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The Narrative Waltz: The Role of Flexibility in Writing Proficiency 


Laura K. Allen, Erica L. Snow, and Danielle S. McNamara 
Arizona State University 


A commonly held belief among educators, researchers, and students is that high-quality texts are easier 
to read than low-quality texts, as they contain more engaging narrative and story-like elements. 
Interestingly, these assumptions have typically failed to be supported by the literature on writing. 
Previous research suggests that higher quality writing is typically associated with decreased levels of text 
narrativity and readability. In this study, the authors present the hypothesis that writing proficiency is 
associated with an individual’s flexible use of linguistic properties, rather than simply the consistent use 
of a particular set of linguistic properties. To test this hypothesis, the authors leveraged both natural 
language processing and dynamic methodologies to capture variability in students’ use of narrative style 
across multiple essay prompts. Forty-five high school students wrote 16 essays across 8 laboratory 
sessions. Natural language processing techniques were first used to calculate the narrativity of each essay. 
Random walk and Euclidian distance measures were then used to visualize and classify students’ 
flexibility in narrativity across essays. The results support the hypotheses that students who were flexible 
in their use of narrativity also wrote essays that were rated as having higher quality, whereas inflexible 
writers tended to write lower quality essays. Additionally, more flexible writers performed higher than 
the more inflexible writers on general assessments of literacy and prior knowledge. These results are 
important for researchers and educators, as they indicate that the link between textual properties and 


writing quality may fluctuate according to the context of a given writing assignment. 


Keywords: writing, flexibility, dynamics, linguistics, individual differences 


The study of writing proficiency typically involves the collec- 
tion of essays that students have written in response to a particular 
topic, and the subsequent scoring of these essays is based on their 
linguistic and rhetorical properties. The score that a student re- 
ceives on this essay is then presumed to serve as a strong proxy for 
their writing proficiency (Attali & Burstein, 2006). Importantly, 
however, this essay scoring process is extremely difficult and 
subjective—even for trained, expert raters—and therefore may not 
fully capture the construct of writing proficiency (Huot, 1990, 
1996; Meadows & Billington, 2005). Accordingly, an important 
area of research regards whether and how writing proficiency can 
be more reliably captured, particularly emphasizing the specific 
characteristics of both the individual writers and the texts they 
produce (Crowhurst, 1990; McNamara, Crossley, & McCarthy, 
2010; Rafoth & Rubin, 1984; Witte & Faigley, 1981). Findings 
from such research can inform our theoretical understanding of the 
writing process (Flower & Hayes, 1981; Hayes, 1996; Kellogg, 
2008; McCutchen, 2000; Swanson & Berninger, 1996), as well as 
the development and automation of writing quality assessments 
(Attali & Burstein, 2006; McNamara, Crossley, & Roscoe, 2013; 
McNamara, Crossley, Roscoe, Allen, & Dai, 2015) and pedagog- 


_: Log Rie een ale Tc, Toe TORT LI 
This article was published Online First January 18, 2016. 
Laura K. Allen, Erica L. Snow, and Danielle S. McNamara, Learning 
Sciences Institute, Department of Psychology, Arizona State University. 
Correspondence concerning this article should be addressed to Laura K. 
Allen, Learning Sciences Institute, Department of Psychology, Arizona 
State University, P.O. Box 872111, Tempe, AZ 85287-2111. E-mail: 
LauraK Allen @asu.edu 


Oi 


ical interventions for struggling writers (Roscoe, Varner, Crossley, 
& McNamara, 2013; Shermis & Burstein, 2003). 

One assumption that is commonly held among educators, re- 
searchers, and students is that more proficient writers produce texts 
that are easier to comprehend than less proficient writers. This 
assumption relies on the notion that narrative text properties, such 
as events, characters, and personal anecdotes, help authors to gain 
the attention of their readers and, subsequently, make texts more 
relatable (Newkirk, 1997, 2012). Indeed, prior research has con- 
firmed that texts with more narrative elements are typically easier 
to comprehend than informational texts (Bruner, 1986; Graesser, 
Olde, & Klettke, 2002; Haberlandt & Graesser, 1985). Addition- 
ally, the degree to which a text is narrative as opposed to infor- 
mative is indicative of its readability across a number of domains 
and grade levels (Graesser, McNamara, & Kulikowich, 2011). 
Interestingly, however, the link between narrativity and essay 
quality has failed to be supported by prior literature. Although 
narrative elements may sometimes be associated with high-quality 
writing (Crossley, Roscoe, & McNamara, 2014), the majority of 
research on essay quality suggests that higher quality writing is 
associated with decreased levels of text narrativity and measures 
of readability in general (Crossley, Weston, McLain Sullivan, & 
McNamara, 2011; McNamara et al., 2013). 

One potential explanation for this conflicting evidence lies in 
the situational influence of narrative text elements on writing 
quality. In other words, it is possible that the frequency of specific 
linguistic or rhetorical text elements alone is not consistently 
indicative of essay quality. Rather, these effects may be largely 
driven by individual differences in students’ ability to leverage the 
benefits of these elements in the appropriate contexts. In this 
article, we hypothesize that writing proficiency is associated with 
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an individual’s flexible use of text properties, rather than simply 
the consistent use of a particular set of properties. Some research- 
ers have cited flexibility as a characteristic of strong writers 
(Graham et al., 2012; Graham & Perin, 2007). Graham and Perin 
(2007), for instance, claimed “proficient writers can adapt their 
writing flexibly to the context in which it takes place” (p. 9). 
However, few studies (if any) have empirically tested this claim. In 
the current study, we address this research gap by investigating 
how writing proficiency relates to students’ flexible use of narra- 
tivity across multiple essay prompts. 


Writing Proficiency 


Writing is a complex and demanding activity that requires 
individuals to coordinate a number of cognitive skills and knowl- 
edge sources through the process of setting goals, solving prob- 
lems, and strategically managing their memory resources (Flower 
& Hayes, 1981; Hayes, 1996). Importantly, this writing process 
differs across individuals. Each student brings different strengths 
and weaknesses to a given writing task, and these variables interact 
to affect their unique writing processes, as well as the strategies 
and procedures they utilize to produce effective writing. Individual 
differences can encompass a broad range of characteristics, from 
students’ degree of prior knowledge (e.g., word and content 
knowledge) to their daily and overall affect (e. g., their motivation 
to succeed). Indeed, many models of writing proficiency attempt to 
account for the influence of individual differences among students, 
such as knowledge, skill, and working memory capacity (e.g., 
Kellogg, 2008; McCutchen, 2000; Swanson & Berninger, 1996). 

One important difference between skilled and less skilled writ- 
ers is their level of reading comprehension skill. Reading and 
writing are tightly connected cognitive processes (Allen, Snow, 
Crossley, Jackson, & McNamara, 2014; Fitzgerald & Shanahan, 
2000; Shanahan & Tierney, 1990; Tierney & Shanahan, 1991); 
therefore, students who are better at comprehending texts (as well 
as those who read more frequently) also tend to be better at 
generating high-quality texts. Similarly, writing proficiency can be 
influenced by differences in students’ vocabulary knowledge (Al- 
len, Snow, Crossley et al., 2014; Graham & Perin, 2007). Students 
who have access to a greater number of vocabulary words have a 
greater number of options regarding how they convey ideas. 

Strong writers also differ from weak writers in their knowledge 
of the writing process, including their understanding of writing 
goals and strategies. For example, Saddler and Graham (2007) 
found that less skilled writers demonstrated a weaker understand- 
ing of writing goals (d = —1.13), were less knowledgeable of the 
differences between strong and poor writing (d = —.98), and had 
less knowledge of efficient writing strategies (d = — 1.10). Addi- 
tionally, these less skilled writers wrote lower quality and shorter 
essays. 

Finally, individual differences in prior world knowledge may 
influence writing proficiency (McCutchen, 1986; Olinghouse, 
Graham, & Gillespie, 2015). Olinghouse and colleagues (2015), 
for instance, recently examined the role of discourse and topic 
knowledge in the quality and characteristics of fifth grade stu- 
dents’ stories, persuasive essays, and informational text. The re- 
sults of this study suggested that discourse and topic knowledge 
were important elements of young students’ writing skills. Specif- 
ically, they found that each of the two forms of knowledge made 


unique, significant contributions to a prediction of writing quality. 
These results are important, as they indicate that variability in 
knowledge can influence the quality of a written text. This is 
important, particularly in the context of persuasive essay writing, 
because students who know more about the world can, theoreti- 
cally, develop stronger arguments, as they have greater access to 
supporting examples and evidence. 


Linguistic Properties of High-Quality Writing 


Many of these characteristics of skilled writers (e.g., strong 
reading comprehension skills, etc.) are directly related to their 
production of specific linguistic properties in essays (Deane, 
2013). In particular, more sophisticated linguistic text properties 
(e.g., cohesion, complex syntax) are related to higher cognitive 
functioning. Thus, their presence in an essay is indicative of a 
student’s ability to more easily produce complex text, which 
allows them to place a greater focus on higher level rhetorical and 
conceptual text properties (Deane, 2013). To this end, many re- 
searchers have sought to identify the linguistic properties that 
relate to high-quality writing (e.g., Applebee, Langer, Jenkins, 
Mullis, & Foertsch, 1990; Crossley, Roscoe, McNamara, & 
Graesser, 2011; Ferrari, Bouffard, & Rainville, 1998; McNamara 
et al., 2010; Varner, Roscoe, & McNamara, 2013; Witte & Faig- 
ley, 1981). In these studies, trained, expert human raters typically 
score essays based on a standardized rubric (e.g., the SAT rubric). 
The essays are then analyzed for specific linguistic properties, 
either using computational text analysis tools or human coding. 
Finally, statistical techniques (e.g., regression analyses, ANOVAs, 
discriminant function analyses) are employed to determine 
whether there are specific linguistic properties that systematically 
relate to these human judgments of essay quality. 

These previous analyses have provided critical information 
about the linguistic properties of high-quality writing (particularly 
in the context of academic essays; Applebee et al., 1990; Crossley 
et al., 2011; Ferrari et al., 1998; McNamara et al., 2010; Witte & 
Faigley, 1981). For instance, skilled writers tend to produce longer 
essays (Crossley, Weston, et al., 2011; Ferrari et al., 1998; Has- 
well, 2000; McNamara et al., 2010; McNamara et al., 2013) that 
contain fewer spelling and grammar errors (Ferrari et al., 1998). At 
the word level, more proficient writers (i.e., writers that produce 
higher quality essays and writers in higher grades) use longer 
words (Haswell, 2000) that are less frequent and concrete, but are 
more abstract (Crossley, Weston, et al., 2011; McNamara et al., 
2010; McNamara et al., 2013). Similarly, previous research has 
demonstrated that more advanced writers produce essays that 
contain more complex sentence structures (McCutchen et al., 
1994). Haswell (2000), for instance, reported that advanced writers 
produced essays that contained longer sentences and clauses, and 
McNamara and colleagues (2010) reported that higher quality 
essays contained sentences that had a greater number of words 
before the main verb phrase (i.e., more complex sentence struc- 
tures). 

Finally, specific rhetorical and stylistic text properties have been 
associated with higher quality essays. Past studies have found that 
human ratings of essay quality tend to be negatively related to the 
frequency of narrative text properties, but positively related to the 
number of rhetorical structures that focus on contrasted ideas, 
explicitly stated arguments, conditional structures, and reported 
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speech (Crossley, Weston, et al., 2011; McNamara et al., 2013). 
Overall, previous research studies reveal that more sophisticated 
writers (defined by both essay scores and higher grade levels) tend 
to produce essays that are longer and contain properties that are 
more indicative of sophisticated lexical, syntactic, and rhetorical 
choices. 


Situational Variability of Writing Quality 


Recently, researchers have noted that the text properties asso- 
ciated with essay quality often vary across different raters, authors, 
assignments, and contexts (e.g., Allen, Snow, & McNamara, 2014; 
Crossley, Allen, & McNamara, 2014; Crossley, Varner, & Mc- 
Namara, 2013; Crossley, Varner, Roscoe, & McNamara, 2013; 
Crossley, Weston, et al., 2011; Varner et al., 2013). Crossley and 
colleagues (2014), for instance, argued that high-quality essays can 
take on a number of different forms—in other words, these essays 
can range quite broadly in their combinations of linguistic prop- 
erties. To investigate this argument, they employed a cluster anal- 
ysis approach for the purpose of identifying multiple linguistic 
profiles of successful essays. Their analysis revealed four distinct 
profiles of successful writers, which were linguistically distinct 
from one another. They argued that these results provided evidence 
that successful writing cannot be simply defined by one set of 
predefined linguistic properties—rather, successful writing can 
manifest in a number of different ways. 

Our hypothesis is that writing proficiency is related (at least in 
part) to students’ sensitivity to these different writing styles and, 
consequently, their ability to flexibly adapt the properties of their 
essays according to the specific context of the writing task. Writing 
proficiency, in other words, is partially characterized by an indi- 
vidual’s ability to assess the context of their writing task and 
flexibly call upon various linguistic tools, given their knowledge of 
the constraints and demands of that surrounding environment. For 
example, if a writer has a strong degree of prior knowledge about 
the topic for a particular writing assignment, they may not need to 
employ narrative, story-like properties in order to persuade the 
reader to take their side on a given argument. On the other hand, 
if writer is presented with a topic on which they know few explicit 
facts, they might leverage these narrative story elements for the 
purpose of engaging their readers and eliciting emotional reac- 
tions. Writers in both of these examples could potentially develop 
successful essays (e.g., they might persuade their readers to take a 
particular side on an argument); however, the two essays would be 
composed of vastly different writing styles. 

Here, we define writing flexibility as an individual’s ability to 
adapt specific components of their writing in order to craft more 
effective text. Our argument is that quality texts should not be 
assessed using a one-size-fits-all formula, rather, successful text 
communication will depend on a large number of contextual 
factors, such as the prior knowledge and motivations of the writer 
and the audience, as well as specific characteristics of the assign- 
ment. Importantly, these characteristics of the writing task interact 
with each other to impact the demands of a particular writing 
assignment. Thus, writers must assess each writing task on an 
individual basis to determine the most appropriate strategies and 
approaches for completing an assignment. In this vein, we argue 
that more proficient writers will exhibit flexibility in their writing 
styles across different writing assignments. Our proposal in this 


article is that we can measure linguistic flexibility (i.e., the degree 
to which individuals vary their linguistic style across multiple 
essays) to serve as a proxy for this broader notion of writing 
flexibility. 


Current Study 


The goal of the current study is to test the hypothesis that better 
writing is associated with increased flexibility of writing style, 
rather than only a set of static linguistic characteristics. This 
concept of “flexible” writers is in direct contrast to writers who use 
a fixed set of linguistic properties within the majority of their 
essays—in other words, they are inflexible. There have been mixed 
empirical findings regarding the relationship between text narra- 
tivity (and readability, more broadly) and essay quality. In this 
study, we suggest that this may result, in part, from the various 
demands of the writing assignment. In other words, different 
writing prompts and assignments may call on different skills and 
knowledge sources, which can differentially affect the writing 
strategies and processes engaged by individuals. Thus, we addi- 
tionally suggest that this flexibility in writing style may result as a 
function of individual differences related to literacy skills, such as 
vocabulary knowledge, comprehension ability, and prior world 
knowledge. Our primary research questions are: 


1. How is writing proficiency related to students’ flexible 
use of narrativity? 


2. How does this flexible use of narrativity vary as a func- 
tion of individual differences among students? 


We first hypothesize that greater writing proficiency will be 
positively associated with students’ linguistic flexibility across the 
essays. In particular, we hypothesize that students who vary in 
their use of narrative language across multiple essays will also 
produce essays that are rated as higher quality texts. 

Second, we hypothesize that this measure of narrative flexibility 
will vary as a function of individual differences among the stu- 
dents. This hypothesis follows from the assumption that writing 
flexibility is a strategic behavior that relates to students’ literacy 
abilities and prior knowledge of a given topic. Thus, students who 
have developed strong literacy skills will be more likely to assess 
when it is appropriate to employ specific linguistic and rhetorical 
devices within individual writing assignments. 

This study combines both natural language processing and dy- 
namical techniques to characterize the degree to which students 
vary in their use of narrativity across 16 timed, argumentative, 
prompt-based essays. Thus, writing flexibility is measured here in 
a very specific context. We chose to specifically focus on the 
narrativity within the essays because of the previously mixed 
empirical findings regarding the construct of narrativity in text 
quality. Crossley and colleagues (2014), for instance, found that 
one profile of high-quality writing related to a more narrative, 
story-like style, whereas a separate profile of essays (of equally 
high quality) were related to more informative, academic text. 
Thus, an important research question is whether more proficient 
writers are able to leverage the benefits of both narrative and 
informative styles according to the demands of specific writing 
assignments. For instance, one skilled writer might have little 
fact-based domain knowledge with which to develop evidence on 
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a particular prompt. Therefore, this writer might construct an essay 
that relies on personal anecdotes and descriptions that are engaging 
to the reader. On the other hand, another skilled writer might rely 
more heavily on fact-based evidence to answer the prompt. In this 
essay, the writer would use facts to argue a particular perspective 
on the prompt question. In both scenarios, the resulting essays are 
high quality and successfully able to argue a particular point to the 
reader. However, the two writers simply used different strategies 
to achieve this goal. 

An additional note is that this study solely focuses on timed, 
prompt-based essays. Although we argue that this investigation of 
narrativity is important across a number of different writing 
genres, we chose to focus our initial analysis on this genre because 
these essays do not require prior content knowledge of a particular 
domain. This allows us to more easily tease apart our results in 
terms of their relationship to writing proficiency, rather than 
greater knowledge of a particular domain. 


Methods of Automated Text Analysis 


To address our research questions, we use a combination of 
natural language processing and dynamic methodologies to exam- 
ine students’ use of narrativity across multiple argumentative 
essays. Text narrativity is a key component of text readability; 
therefore, it provides a strong foundation on which to build an 
understanding of the relations between text readability and essay 
quality. In this study, we chose to leverage automated text analysis 
tools to provide a measure of text narrativity. Automated indices 
provide a quick and reliable alternative to the subjective coding of 
essays by humans. 

Automated measures of text readability and narrativity. In 
the current study, we employed Coh-Metrix (McNamara & 
Graesser, 2012; McNamara, Graesser, McCarthy, & Cai, 2014) to 
automatically assess the degree to which students’ essays were 
more narrative or informative. The principal method for automat- 
ically measuring text difficulty is the use of standardized “read- 
ability” formulas (Hiebert, 2002). These formulas provide a single 
metric by which the relative syntactic and semantic difficulty of 
texts can be compared. One of the most common readability 
formulas is the Flesch-Kincaid Grade Level (FKGL; Kincaid, 
Fishburne, Rogers, & Chissom, 1975), which calculates word and 
sentence length to determine text difficulty. This score is a single 
index that maps onto the grade levels in the U.S. school system. 
Unidimensional measures, such as FKGL, can simplify the text 
assignment process by providing teachers a single metric to select 
grade-appropriate texts for their students. 

Despite their simplicity, traditional readability formulas lack the 
sophistication needed to represent the multiple levels of text dif- 
ficulty. One problem is that these formulas typically measure the 
surface-level characteristics of texts, which are solely predictive of 
students’ superficial text comprehension (i.e., their understanding 
of the individual words and sentences; Davison, 1984). Most 
contemporary models of reading comprehension suggest that there 
are multiple levels of understanding that contribute to the compre- 
hension process (Graesser & McNamara, 2011). However, stan- 
dard readability formulas often fail to identify the text character- 
istics that impact students’ understanding at deep levels (e.g., deep 
cohesion). Further, they provide teachers little guidance on how to 
diagnose and remediate students’ difficulties. In particular, they 


give no information on which text properties may be challenging 
or helpful to individual students. 

Coh-Metrix (McNamara & Graesser, 2012; McNamara et al., 
2014) is a computational text analysis tool that was developed, in 
part, to provide stronger measures of text difficulty (Duran, Bel- 
lissens, Taylor, & McNamara, 2007). This tool analyzes texts at 
the word, sentence, and discourse levels; thus, it can potentially 
offer more information about the specific challenges and linguistic 
scaffolds contained in a given text. Previous work with Coh- 
Metrix suggests that multiple dimensions coordinate within texts 
to affect subsequent comprehension performance (McNamara, 
Graesser, & Louwerse, 2012). To account for these multiple text 
dimensions, Graesser and colleagues (2011) developed the Coh- 
Metrix Easability Components. These components offer a detailed 
glance at the primary levels of text difficulty and are well aligned 
with an existing multilevel framework (Graesser & McNamara, 
2011). 

Narrativity. The degree of narrativity versus informational 
content provided within an essay is assessed using the narrativity 
component score provided by Coh-Metrix (Graesser et al., 2011; 
McNamara, 2013). The narrativity of a text reflects the degree to 
which a story is being told, using characters, places, events, and 
other elements that are familiar to readers. This measure is highly 
related to the use of familiar words, greater world knowledge, and 
oral language style. Combining many narrative elements within a 
text can be used to sustain readers’ attention by creating uncer- 
tainty, excitement, or building suspense (Barab, Gresalfi, Dodge, 
& Ingram-Goble, 2010; Cheong & Young, 2006; Vorderer, Wulff, 
& Friedrichsen, 1996). Additionally, narrativity allows readers to 
connect and comprehend action sequences, making it easier to 
keep track of main characters, plot points, and cause-and-effect 
relationships (Bruner, 1986; Schank & Abelson, 1995). The degree 
to which a text is narrative is strongly associated with word 
familiarity, world knowledge, and oral language. 

Because of their engaging and familiar properties, highly nar- 
rative texts are considerably easier to read, comprehend, and recall 
than informative texts (Graesser & McNamara, 2011; Haberlandt 
& Graesser, 1985). Within the context of essay writing, however, 
the role of narrativity is less clear. Persuasive essays written with 
lower degrees of narrativity are typically rated as having higher 
quality (as judged by expert human raters who use standardized 
rubrics) than more narrative essays (although not consistently), 
include more content words (e.g., nouns), and discuss more unfa- 
miliar topics. The use of facts and data as evidence in an essay (as 
opposed to, e.g., personal anecdotes) is associated with more 
refined rhetorical strategies on the part of the writer, which may 
serve to explain negative correlations between narrativity and 
essay scores. 

The narrativity component score is calculated in Coh-Metrix 
based on the results of a previous, large-scale corpus analysis 
(Graesser et al., 2011). In this study, the Touchstone Applied 
Science Associates (TASA) corpus was used to provide a repre- 
sentative sample of the types of texts that are commonly seen from 
kindergarten through 12th grade. This corpus consists of 37,520 
texts (average of 288.6 words per text, SD = 25.4) that have been 
classified according to genre and assigned an appropriate grade 
level. To develop the narrativity score (and the other Easability 
components), Graesser and colleagues (2011) first used Coh- 
Metrix to analyze the linguistic characteristics of the texts in the 
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TASA corpus (53 measures were used; see Graesser et al., 2011, 
for more specific information about these indices). These indices 
ranged from basic word level information (e.g., word frequency) to 
higher level information about semantic text cohesion. A principal 
component analysis (PCA) was conducted to reduce these indices 
to a smaller number of dimensions. The Coh-Metrix measures 
converged on the PCA with eight principle component scores, 
accounting for 67.3% of the variability among the texts. 

The narrativity Easability Component score consists of 17 Coh- 
Metrix indices, with loadings ranging from 0.53 to 0.92. These 
indices provide critical information about the differences between 
narrative and informational texts. First, narrative texts include 
more descriptions of actions and events; thus, the narrativity Eas- 
ability Component assigns its scores (in part) based on the notion 
that more narrative texts contain more main verbs, adverbs, and 
intentional events, actions, and particles. Informational texts, on 
the other hand, are characterized by more unfamiliar content 
words, often in the form of nouns. An additional characteristic of 
narrative texts is that they share many characteristics of oral 
language (Biber, 1988), as evidenced by the increased frequency 
of familiar words and pronouns in the narrativity Easability Com- 
ponent, as well as the use of simpler sentence constructions. 

The resulting narrativity Easability Component score is calcu- 
lated in the form of a percentile score (ranging from 0% to 100%), 
with higher scores indicating that the text is more narrative than 
informative (and likely easier to read) than other texts in the TASA 
corpus. For instance, a percentile score of 85% means that 85% of 
the texts in the TASA corpus are likely more difficult than the 
particular text (at least in terms of its narrativity), and 15% are 
likely easier to read. Overall, the Coh-Metrix narrativity Easability 
Component score can serve as a measure of text readability, 
specifically regarding the degree of story-like elements that are 
present within an individual text. 


Dynamic Analyses 


In the current study, we use dynamic systems theory and its 
associated analysis techniques to analyze the flexible relations 
between the narrative properties of essays and students’ writing 
proficiency. Dynamic methodologies offer researchers a means 
with which they can characterize patterns that emerge from stu- 
dents’ behaviors or interactions (e.g., writing, dialect, or choices) 
during a learning task. Unlike more traditional statistical measures, 
dynamic methodologies place a strong emphasis on the role of 
time in the assessment of behavioral patterns and change. In other 
words, dynamic analyses focus on the individual fluctuations that 
occur across time, as opposed to treating behavior as a static (.e., 
inflexible) process, as is customary in many traditional statistical 
approaches (i.e., self-reports). Dynamic methodologies can, there- 
fore, help to contextualize students’ behaviors and offer educators 
and researchers a means of capturing important fine-grained pat- 
terns across time. 

Although the current study is one of the first to use dynamic 
analyses to assess writing flexibility, these techniques have previ- 
ously been used across a wide variety of domains as a means to 
understand the complex patterns that manifest in individuals’ 
behaviors over time (Snow, Allen, Russell, & McNamara, 2014; 
Snow, Likens, Jackson, & McNamara, 2013; Soller & Lesgold, 
2003; Zhou, 2013). Here, we utilize two dynamic methodologies— 


random walks and Euclidian distances—to visualize and classify 
the extent to which students demonstrate a flexible use of narrative 
properties across time. Random walks are mathematical tools that 
are used to visualize fine-grained patterns that emerge in categor- 
ical data over time (Nelson & Plosser, 1982; Snow et al., 2013). 
Researchers have used this technique in a variety of domains, such 
as psychology (Allen, Snow, & McNamara, 2014; Collins & De 
Luca, 1993), genetics (Lobry, 1996), ecology (Benhamou & 
Bovet, 1989), and the learning sciences (Snow et al., 2013). For 
example, geneticists have utilized random walk analyses to inves- 
tigate how patterns of disease form within gene sequences (Ar- 
neodo et al., 1995; Lobry, 1996), and learning scientists have used 
this methodology to visualize how students’ choice patterns within 
computer-based learning environments vary as a function of their 
prior skills (Snow et al., 2013). 

In order to validate the visualizations offered by these random 
walk analyses, researchers need to quantify these fine-grained 
patterns of behavior. Euclidian distance analyses offer a metric that 
is embedded within the random walks that can quantify students’ 
fluctuations as they unfold over time (Allen, Snow, & McNamara, 
2014). In this calculation, Euclidian distances for each “step” or 
movement within a random walk analysis are used to create a 
distance time series. This time series serves as a quantification for 
the movements in the categorical patterns visually represented in 
the random walk. 


Method 


Participants 


The data presented here were collected as part of a larger study 
(n = 86), which compared the Writing Pal intelligent tutoring 
system (ITS) to an Automated Writing Evaluation (AWE) system 
(Allen, Crossley et al., 2015; Allen, Crossley, Snow, & McNa- 
mara, 2014; Crossley, Varner, Roscoe, et al., 2013; Roscoe & Mc- 
Namara, 2013). In this study, we focus on the participants who 
engaged with the AWE system (n = 45). All participants were 
high school students recruited from an urban environment located 
in the southwestern United States. These students were, on aver- 
age, 16.4 years of age, with a mean reported grade level of 10.5. 

Of the 45 students, 66.7% were female and 31.1% were male. 
Students self-reported ethnicity breakdown was as follows: 62.2% 
were Hispanic, 13.3% were Asian, 6.7% were Caucasian, 6.7% were 
African American, and 11.1% reported, “other.” All students were 
recruited from local high schools and publically posted flyers. These 
students received $10.00 for their participation in each session of this 
experiment. Additionally, the students’ money was doubled for com- 
pleting all 10 of the sessions. Thus, the participants in this study each 
received $200 for their participation. 


Study Procedure 


The current study was a 10-session experiment that lasted ap- 
proximately three weeks. During the first session, students com- 
pleted a pretest that contained measures of writing ability, prior 
knowledge, reading ability, and literacy skills. Training occurred 
during the following eight sessions, in which students engaged 
with the AWE system. During Session 10, students completed a 
posttest, which contained measures similar to the pretest. Previous 
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analyses have indicated that students increased their essay quality, 
motivation, perceptions of improvement, and self-assessment ac- 
curacy across the training sessions (for more thorough information 
on the results of the training study, see Allen, Crossley, et al., 
2015). 

Pretest. During Session 1, students completed a pretest that 
lasted approximately one hour in duration and contained a battery 
of individual difference measures. These measures included de- 
mographics, prior knowledge test, writing proficiency (25-min 
SAT-style essay), and literacy skills. 

Training. During training (Sessions 2 to 9), students practiced 
writing 25-min timed essays on SAT-style prompts. During each 
of the eight training sessions students wrote and revised two timed 
essays (i.e., 16 essays). Upon completion of each essay, the AWE 
system provided students with automated formative feedback. 
After students examined this feedback they were allotted 10 min to 
revise their essay based on the feedback presented. 

Posttest. During Session 10, all participants completed a post- 
test. The posttest comprised measures similar to the pretest, in- 
cluding a writing proficiency test (25-min SAT-style essay). 


Materials and Measures 


Prior reading ability. Students’ reading ability was assessed 
using the Gates-MacGinitie reading skill test (4th ed.; MacGinitie 
& MacGinitie, 1989). This 48-item multiple-choice test assessed 
students’ reading comprehension ability by asking students to read 
short passages and then answer two to six questions about the 
content of the passage. These questions were designed to measure 
both shallow- and deep-level comprehension. All students were 
given standard instructions, which included two practice questions. 
This test was a timed task that gave every student 20 min to answer 
as many questions as possible. The Gates-MacGinitie Reading 
Test is a well-established measure of student reading comprehen- 
sion, which provides information about students’ literacy abilities 
(a = .85-.92; Phillips, Norris, Osmond, & Maynard, 2002). 

Vocabulary knowledge. Students’ vocabulary knowledge 
was assessed using the Gates-MacGinitie vocabulary test (4th ed.; 
MacGinitie & MacGinitie, 1989; see previous section for reliabil- 
ity). This test includes 45 simple sentences, each with an under- 
lined vocabulary word. Students are asked to read the sentence and 
choose the word most closely related to the underlined word within 
the sentence from a list of five choices. All students’ were given 
standard instructions, which included two practice questions. This 
test was a timed task that gave every student 10 min to answer as 
many questions as possible. 

Prior knowledge. Students’ prior science knowledge was as- 
sessed using a 30-item measure of prior knowledge designed for 
use with high school students. This task has been used previously 
in work related to reading comprehension and strategy skill acqui- 
sition (Roscoe, Crossley, Snow, Varner, & McNamara, 2014). The 
30-item multiple-choice measure assesses students’ knowledge in 
the areas of science, literature, and history. The test shows high 
reliability, with a ranging from .72 to .81. The measure is a 
modified version of a knowledge assessment used in several stud- 
ies and validated with over 4,000 high school and college students 
(McNamara, O’Reilly, Best, & Ozuru, 2006; O’Reilly, Best, & 
McNamara, 2004; O’Reilly & McNamara, 2007; O’Reilly, Taylor, 
& McNamara, 2006). This version of the assessment was devel- 


oped in prior work by including items with moderate difficulty 
(ie., 30%-60% of students could answer correctly) that were 
correlated with individual difference measures (e.g., reading skill) 
and performance on comprehension tests. Additional items were 
obtained from high school textbooks. In this process, 55 multiple- 
choice questions (i.e., 18 science, 18 history, and 19 literature) 
were piloted with 15 undergraduates to test item performance. 
Thirty questions (10 per domain) were selected such that no items 
selected exhibited either a ceiling (>.90) or floor effect (<.25, 
chance level). Examples are provided in Table 1. 

Pretest and posttest essay quality. Students writing profi- 
ciency was assessed at both pretest and posttest through the use of 
timed (25-min) and counterbalanced SAT-style essays (the two 
essay prompts can be found in the Appendix). The pretest and 
posttest essays were assessed on a 6-point scale by two indepen- 
dent expert human raters. These raters had previous experience 
scoring academic essays and were compensated for their time. 
Additionally, they were college composition instructors with at 
least three years of experience teaching writing. The holistic rating 
scale was developed in order to assess the quality of each essay on 
a scale from 1 to 6.' The raters were given specific instruction on 
this rubric and given example essays for each score in the rubric 
(i.e., they were given an example of an essay that had received a 
score of “1” and another essay that had received a score of “2”). 
Additionally, they were told that the distance between each score 
was equal (i.e., a score of 5 is as far above a score of 4 as a score 
of 3 is above a score of 2). After receiving instruction on the 
rubric, the raters practiced using the rubric on a sample set of SAT 
Style essays written on the same prompts as the essays in the 
current study. The raters were expected to continue with practice 
until their interrater reliability reached a correlation of r = .70. 
After the raters had reached an interrater reliability of r = .70, each 
rater then evaluated the entire set of essays. Thus, each essay 
received two essay scores. Once these ratings were collected, 
differences between the raters’ scores were calculated. All score 
differences between the raters were less than 2 (i.e., the raters 
demonstrated 100% adjacent agreement with the final set). Thus, 
holistic scores for pretest and posttest essays were calculated by 
averaging the scores between raters. For the final set, the raters 
demonstrated a 57% exact accuracy and a 100% adjacent accuracy. 
Additionally, the raters’ final essay scores were significantly cor- 
related, r = .55, p < .001. 

Training essay performance. Training performance in this 
study was defined as students’ average essay score across the 16 
essays that were composed in the AWE system. All of the essays 
that students wrote in this AWE system were timed, SAT-style 
essays, with prompts that were similar to those given at pretest and 
posttest (for a list of the prompt topics and the order they were 
assigned, see Table 2). To score these essays, we used a previously 
developed algorithm to assign holistic writing scores to these 
written essays. The algorithm uses variables from Coh-Metrix, the 
Writing Assessment Tool and Linguistic Inquiry and Word Count 
(Pennebaker, Booth, & Francis, 2007) to assign essay scores on a 
scale from | to 6. These indices range from word-level properties 
of the essays, such as the number of infinitives, to higher level 


‘For a copy of the SAT rubric, see http://sat.collegeboard.org/scores/ 
sat-essay-scoring-guide. 
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Table 1 


Examples of Questions and Answers in Prior 
Knowledge Assessment 








Domain Question and answer choices 

Science The poisons produced by some bacteria are called . . . (a) 
antibiotics, (b) toxins, (c) pathogens, (d) oncogenes 

History A painter who was also knowledgeable about mathematics, 
geology, music, and engineering was . . . (a) 
Michelangelo, (b) Cellini, (c) Titian, (d) da Vinci 

Literature | Which of the following is the setting used in The Great 


Gatsby . . . (a) New York, (b) Boston, (c) New Orleans, 
(d) Paris 





properties, such as the semantic similarity of the paragraphs within 
the essay. The algorithm was developed using correlation and 
discriminate function analyses to categorize 1,243 student essays 
that had been previously scored by expert human raters. The 
resulting models reported exact matches between the human scores 
and the predicted essay scores with 55% accuracy. Additionally, 
the models reported 92% accuracy for adjacent matches (see 
McNamara et al., 2015, for a more thorough description of the 
algorithm used in this study). 

Assessment of narrative flexibility. We used random walk 
analyses to investigate the flexibility of students’ use of narrativity 
across time. Random walk analyses are mathematical tools that are 
used to provide visual representations of patterns in categorical 
data as they manifest across time (Benhamou & Bovet, 1989; 
Lobry, 1996; Nelson & Plosser, 1982; Snow et al., 2013). In the 
current study, we first used Coh-Metrix to compute a narrativity 
percentile score (range from 0 to 100) for each essay. We then 
used this narrativity percentile score to classify each essay into 
four orthogonal categories (see Table 3). This classification was 
organized based on the degree of narrativity present in each essay 
(using the percentile score provided by Coh-Metrix). Each orthog- 
onal category was then assigned to a vector that fell along a basic 
scatterplot. Therefore, if an essay received a narrativity score 
below 25%, this essay was assigned to the vector (—1, 0), whereas 
an essay that received a score that was greater than 75% narrative 


Table 2 


was assigned to the vector (0, —1). Once each essay had been 
assigned to a vector, we calculated a random walk for each student 
that began at the origin of the scatterplot (0, 0). For each subse- 
quent essay that a student wrote, the walk would “step” in the 
direction that was consistent with the assigned vector. The result- 
ing walk would represent each student’s use of narrativity across 
the 16 training essays. 

Figure 1 provides an example of what a random walk might look 
like for a student who wrote four training essays. All walk se- 
quences begin at the origin of the scatterplot (see #0 in Figure 1). 
The first essay written by the student was low in narrativity (i.e., 
narrativity percentile score <25%); thus, the walk takes a step left 
along the x-axis (see #1 in Figure 1). The second essay written by 
the student received a narrativity percentile score between 25% 
and 50%: this means that the walk takes a step up along the y-axis 
(see #2 in Figure 1). The student wrote a third essay that had a 
narrativity percentile score between 50% and 75% narrativity. 
Therefore, the walk takes a step to the right along the x-axis (see 
#3 in Figure 1). The fourth and final essay written by the student 
received a narrativity percentile score between 25% and 50%, 
which again makes the walk step up along the y-axis (see #4 in 
Figure 1). These rules were used to generate a unique random walk 
for each of the 45 students, which represented the fluctuations in 
their use of narrativity across the 16 essays that were written in the 
AWE system. 

Figures 2 and 3 illustrate two random walks that were generated 
using two students’ actual training essays from the current study. 
These walks represent students’ degree of “narrative flexibility” 
across the training essays. 

Figure 2 illustrates the walk of a student who wrote highly 
narrative (above 75 narrativity percentile score) essays across each 
of the training essay assignments. In other words, regardless of the 
writing prompt, this student employed the same range of narrativ- 
ity throughout all of her essays. On the other hand, the walk 
depicted in Figure 3 comes from a student who was highly flexible 
in the use of narrativity across the 16 essays. As the various factors 
varied from essay to essay (e.g., the essay prompt), this student 
employed varying degrees of narrativity to develop arguments and 
ideas. 


Writing Pal Essay Prompt Order 


Session 


Essay prompts 


eS Se Co eee 


Session 2 Planning: Does every individual have an obligation to think seriously about important matters? 


Originality: Can people ever be truly original? 
Session 3 Winning: Do people place too much emphasis on winning? 


Loyalty: Should people always maintain their loyalties, or is it sometimes necessary to switch sides? 
Session 4 Patience: Is it better for people to act quickly and expect quick responses from others rather than to wait patiently for what they want? 
Memories: Do personal memories hinder or help people in their effort to learn from their past and succeed in the present? 


Session 5 Heroes: Should we admire heroes but not celebrities? 


Choices: Does having a large number of options to choose from increase or decrease satisfaction with the choices people make? 


Session 6 Perfection: Do people put too much importance on getting every 


Optimism: Is it better for people to be realistic or optimistic? 


detail right on a project or task? 


Session 7 Uniformity: Is it more valuable for people to fit in than to be unique and different? 

Problems: Should individuals or the government be responsible for solving problems that affect our communities and the nation in general? 
Session 8 Beliefs: Are widely held views often wrong, or are such views more likely to be correct? 

Happiness: Are people more likely to be happy if they focus on their personal goals or on the happiness of others? 
Session 9 Fame: Are people motivated to achieve by personal satisfaction rather than by money or fame? 


Honesty: Do circumstances determine whether or not we should tell the truth? 
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Table 3 


Narrativity Classification and Vector Assignment 
er Ee ee 


Essay narrativity level Axis direction assignment 


Less than 25% narrativity 
Between 25% and 50% narrativity 
Between 50% and 75% narrativity 
Greater than 75% narrativity 


—1 on x-axis (move left) 
+1 on y-axis (move up) 
+1 on x-axis (move right) 
—1 on y-axis (move down) 


Euclidian distance measure. The random walks described in 
the Assessment of Narrative Flexibility section provide visualiza- 
tions of the fluctuations in students’ use of narrativity across time. 
To quantify these changes in students’ writing patterns, distance 
time series were calculated for each student using Euclidian dis- 
tance measures. This measure calculated the distances between the 
origin of the scatterplot (0, 0) and each step in the walk (see 
Equation 1). In this equation, y represents the current position of 
the particle (the end point of the walk) on the y-axis, x represents 
the particle’s position on the y-axis and i represents the ith “step” 


in the walk. 
Distance = VV (y; — yo)” + (%; — x0)” (1) 


After calculating the Euclidian distance of the steps in each 
walk, an average Euclidian distance score was calculated for each 
student’s entire walk. Broadly, this measures how far each student 
“walked” from the origin of the scatterplot across the 16 essays. 
This resulting distance measure (i.e., a student’s narrative distance 
score) was used to represent students’ flexibility in their use of 
narrativity. If a student, for example, employed the same degree of 
narrativity across all 16 training essays, that student would travel 
further from the origin, resulting in a high narrativity distance 
score (see Figure 2 for a visualization of this type of student). 
Conversely, if a student varied considerably in the use of narra- 
tivity across the essays, the resulting narrative distance score 
would be lower, as the fluctuations would cause the walk to remain 
closer to the origin (see Figure 3 for a visualization of this type of 


Between 25% and 50% 
Narrativity 







Between 50% and 75% 
Narrativity 


Less than 25% Narrativity 





Greater than 75% Narrativity 


Figure 1. This is an example of a random walk as described in the text. 


Between 25% and 50% 
Narrativity 





Between 50% and 75% 
Narrativity 


Less than 25% Narrativity 


Greater than 75% Narrativity 


Figure 2. This is an example of a random walk for an inflexible writer. 


student). Overall, students’ distance scores provide information 
about whether they are varied in their writing style (i.e., lower 
distance scores and more flexible) or whether they tend to remain 
inflexible (i.e., consistent) across multiple essays (i.e., higher 
distance scores and inflexible). It is important to note that the 
directionality of students’ random walks does not matter, as the 
Euclidian distance measure captures how far (in any direction) 
students’ walks move away from the center point. 

The random walk and Euclidian distance analyses used in the 
current study afford researchers the ability to capture flexibility that 
would otherwise be missed by traditional (i.e., static) metrics. In 
particular, random walk analyses capture movements as they take 
place across time. In this sense, we can analogize the narrative 
flexibility examined in this study to the dancing of the waltz. In the 
waltz, dancers make multiple movements that result in rotations of 
the dancers around the floor. Importantly, in the waltz, skilled 
dancers do not travel across the room in a straight line. Although this 


Between 25% and 50% 
Narrativity 





Less than 25% Narrativity Between 50% and 75% 


Narrativity 


Greater than 75% Narrativity 


Figure 3. This is an example of a random walk for a flexible writer. 
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would result in more efficient travel, these dancers recognize that in 
order to perform the dance in the most graceful way, they must make 
small rotations that result in larger movements across the floor. 
Additionally, they must make adjustments to their behaviors based on 
their partner’s behaviors, as well as the behaviors of the other dancers 
on the floor. Thus, in the waltz, the fine-grained steps and patterns of 
the dancers are important to its overall aesthetics and success. Simi- 
larly, we propose that skilled writers will demonstrate more flexible 
patterns of narrativity across their essays. Thus, rather than consis- 
tently producing essays of the same style, these writers will flexibly 
adapt their behaviors to the demands of the prompt (e.g., based on 
their own prior knowledge, the audience). Related to the random walk 
analyses, if a student generates essays that vary in their degree of 
narrativity, the student’s random walk will hover around the center 
point of the x, y axis and contain more movements that change 
directions. In contrast, a student who is less flexible and consistently 
generates essays with similar levels of narrativity will demonstrate a 
random walk that moves in one direction and covers a greater distance 
along the x- or y-axis. 


Statistical Analyses 


To assess the degree to which writing quality is associated with 
students’ flexible use of narrativity, we calculated random walks, 
Euclidian distances, Pearson correlations, and regression analyses. 
The random walk analyses allowed us to visualize students’ use of 
narrativity across their 16 essays. Additionally, this random walk 
allowed us to calculate a Euclidian distance measure, which reveals 
students’ consistency in their use of narrativity across their 16 essays. 
Pearson correlations were used to assess the relation between flexi- 
bility (as defined by the Euclidian distance measure) and essay qual- 
ity, as well as individual differences in students’ prior global knowl- 
edge, prior vocabulary knowledge, and prior reading comprehension 
ability (see Table 4 for descriptive statistics on these pretest and. 
posttest materials). Finally, regression analyses were conducted to 
follow-up the correlation analyses in order to provide an indication of 
the variables that accounted for the most variability in the dependent 
variables (i.e., essay quality and flexibility). 


Results 


Random Walks 


To visualize and categorize how students varied the narrativity 
in their writing style, random walk analyses were calculated using 


Table 4 
Descriptive Statistics for Pretest and Posttest Materials 
Measure Minimum Maximum Mean (SD) 
Pretest essay score 2.00 4.00 2.80 (.57) 
Posttest essay score 2.00 4.50 3.10 (.64) 
Reading comprehension* 21.00 75.00 47.55 (17.12) 
Vocabulary knowledge* 13.00 89.00 56.44 (20.20) 
Prior knowledge (overall)* 27.00 77.00 51.70 (14.54) 
Science prior knowledge* 20.00 90.00 52.67 (18.02) 
History prior knowledge* 10.00 100.00 54.00 (22.60) 
Literature prior knowledge* 10.00 70.00 48.44 (14.92) 


@ Score is based on percentage correct. 


the rules described in the previous section (see Table 3) for each 
student. These walks produced distance measures for each student, 
which is indicative of how flexible or inflexible the student’s use 
of narrativity was across all 16 essays. Overall, these narrative 
distance measures suggested that students varied considerably in 
their narrative flexibility, ranging from a minimum narrative dis- 
tance score of 2.03 to a maximum narrative distance score of 8.50 
(M = 6.11, SD = 1.73). The narrative distance score for each 
student in this study is plotted in Figure 4 to provide a visualization 
of the degree to which students’ varied in their flexible use of 
narrativity across the 16 training essays. 

This variation in narrative flexibility was examined according to 
students’ writing proficiency. To provide a coarse visualization of 
the flexibility differences between the less and more skilled writ- 
ers, we created a visualization that compared the narrative distance 
scores for two groups of students (based on a median split on 
students’ pretest essay scores): less skilled writers and more skilled 
writers. To confirm that the visualization was depicting two sep- 
arate groups of students, a between-subjects ANOVA investigated 
the difference between these less skilled and more skilled writing 
ability students’ narrative distance scores and revealed that more 
skilled writers had significantly lower narrative distance scores 
(M = 5.29, SD = 1.47) compared with less skilled writers (VM = 
7.02, SD = 1.60), F(1, 42) = 14.06, p = .001, d = 1.13. 

Figure 4 provides an illustration of these differences between 
less and more skilled writers. In this figure, less skilled writers are 
represented as black dots and more skilled writers are represented 
by light gray dots. As shown in this image, the less skilled writers 
(black dots) traveled further from the origin of the scatterplot (0, 0) 
than the more skilled writers (light gray dots), who seem to cluster 
more frequently near the origin. This visualization indicates that 
the more skilled writers were also the students who were more 
varied in their use of narrativity across the training essays (i.e., 
they hovered more around the origin), whereas the less skilled 
writers traveled much further from the origin and were less flexible 
in their use of narrativity. 


Writing Proficiency 


Although the visualization analyses provided preliminary evi- 
dence that less and more skilled writers differed in their narrative 
flexibility, this analysis was based on a median split and, therefore, 
has potential statistical weaknesses. Median splits pose problems 
to statistical validity because they create a false dichotomous 
variable from a continuous variable. Therefore, we conducted 
further analyses to provide more statistically valid tests of our 
research questions. Specifically, Pearson correlations were calcu- 
lated to further assess the validity of these analyses (i.e., to assess 
the degree to which students’ flexible use of narrativity was related 
to their writing proficiency). We calculated the correlations be- 
tween students’ narrative distance scores and their pretest and 
posttest essay scores (assessed by the expert human raters), as well 
as their average scores across the 16 training essays (assessed by 
the AWE algorithm). Results from these analyses indicated that 
narrative distance scores were significantly negatively related to 
the quality of pretest essay scores, r = —.45, p = .002, and 
training essay scores, r = —.47, p = .0O1. Overall, these results 
reveal that skilled writers were more flexible in their use of 
narrativity across the training essays (i.e., they exhibited lower 
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Figure 4. Visualization of less skilled and more skilled students’ random 
walks end points. 


narrative distance scores). However, the relation between narrative 
flexibility and essay scores was no longer present at posttest (p = 
.08). These findings suggest that over the course of persistent 
writing practice, the relation between flexibility in writing style 
and essay quality is reduced. 

We conducted a stepwise regression analysis with the signifi- 
cant variables as predictors to determine which writing proficiency 
measures were the most predictive of narrative flexibility, as well 
as to assess the amount of variance accounted for by these assess- 
ments. This analysis yielded a significant model, F(1, 42) = 11.66, 
p = .001, R® = .22, with one variable retained in the final analysis: 
training essay scores, B = —.47, (42) = —3.41, p = .001. Results 
of this analysis suggested that students’ flexible use of narrativity 
was most strongly predicted by the quality of the essays that they 
wrote across the 8 days of writing practice. Thus, students who 
consistently demonstrated strong writing proficiency were more 
flexible in their use of narrativity throughout essay writing prac- 
tice. 


Individual Differences 


To further investigate the role of narrativity flexibility in the 
writing process, we examined its relationship with individual dif- 
ferences known to relate to writing proficiency. Specifically, we 
calculated Pearson correlations and regression analyses between 
narrative distance scores and students’ pretest scores on assess- 
ments of prior world knowledge, vocabulary knowledge, and read- 
ing comprehension ability. Results of the correlation analyses 
suggested that the narrative distance scores were significantly 
related to all of the pretest measures except for prior knowledge in 
history and literature (see Table 5). These results suggest that 
narrative flexibility is related to other literacy skills and knowledge 
sources, rather than solely related to writing proficiency, as it is 
strongly associated with performance on assessments of prior 
science knowledge as well as literacy skills. 


We conducted a stepwise regression analysis with the signifi- 
cant variables as predictors to determine which individual differ- 
ence measures were the most predictive of narrative flexibility, as 
well as to assess the amount of variance accounted for by these 
assessments. This analysis yielded a significant model, F(1, 43) = 
22.47, p < .001, R*? = .34, with one variable retained in the final 
analysis: reading comprehension, B = —.59, (43) = —4.74, p < 
.001. Results of this analysis suggested that students’ flexible use 
of narrativity was most strongly predicted by ability to read and 
comprehend texts. Thus, students who entered the writing task 
with more strategies and knowledge about how to comprehend 
texts may have had a simpler time adapting their writing styles to 
various prompts, as they were potentially more aware of the 
processes engaged by their readers, and thus more strategic in their 
actions (McNamara, 2013). 


Discussion 


Evidence from the field of writing research largely supports the 
notion that the linguistic properties of texts are generally indicative 
of the holistic quality of those texts. Indeed, results from a number 
of studies have pointed toward specific characteristics that predict 
human judgments of writing quality (Crossley, Roscoe, & Mc- 
Namara, 2013; McNamara et al., 2010; Witte & Faigley, 1981). 
The accuracy of these results, however, often varies along with 
various factors associated with the writing assignment, such as the 
individual rater or the writing prompt (Crossley et al., 2014; 
Crossley, Varner, et al., 2013; Varner et al., 2013). In this study, 
we empirically examined these assumptions through a computa- 
tional linguistic analysis of students’ essays. We leveraged both 
natural language processing and dynamic methodologies to cap- 
ture variability in students’ use of narrative style and to relate that 
variability to individual differences in writing proficiency, as well 
as prior science knowledge and reading comprehension skills. 

The results from the current study support our hypotheses that 
writing proficiency can be characterized (at least in part) by 
students’ flexibility across multiple essay prompts. Namely, stu- 
dents who are more flexible in their use of narrativity tend to 
receive higher scores on their essays, whereas less flexible writers 
tend to produce lower quality essays. Using random walk analyses, 
we were able to visualize students’ flexible or inflexible use or 
narrativity across the 16 training essays. These analyses revealed 
the differential patterns exhibited by the less and more skilled 
writers, with the skilled writers remaining near the origin of the 
scatterplot and the less skilled writers straying further from the 


Table 5 
Correlations Between Narrative Distance Scores and Individual 
Difference Measures 





Individual difference 





measure r 
Reading comprehension oT 
Vocabulary knowledge — Air 
Prior knowledge (overall) = 39% 
Science prior knowledge —.44* 
History prior knowledge meee 
Literature prior knowledge =20 





* p< 105) “peak 
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origin. To quantify the findings from this random walk analysis, 
Euclidian distance measures were calculated. The resulting narra- 
tivity distance scores provided confirmatory empirical support for 
the random walk analyses. In particular, the results demonstrated 
that less skilled students tended to be more consistent (i.e., inflex- 
ible) in the degree to which they used narrative properties (i.e., 
higher narrative distance scores), whereas more skilled students 
demonstrated more flexibility in their use of narrativity across the 
16 essays (i.e., lower narrative distance scores). 

Importantly, the relationship between flexibility and narrativity 
was no longer apparent at posttest. Our interpretation of this result is 
that the quality of the students’ essays had substantially improved by 
the time they wrote the posttest essay, and, therefore, the individual 
differences in flexibility were no longer a factor in their posttest essay 
quality. In other words, the feedback generated by the AWE system 
was effective. Results from a previous analysis of the larger study 
(i.e., the comparison between the Writing Pal ITS condition and the 
AWE condition; Allen, Crossley, et al., 2014, 2015, under review; 
Crossley, Roscoe, et al., 2013; Roscoe & McNamara, 2013) revealed 
that students’ essay scores substantially improved across the training 
sessions (Allen, Crossley et al., 2015). Additionally, the accuracy of 
the students’ self-assessments of essay quality (compared with the 
W-Pal algorithm) increased in accuracy over time. This is important, 
because it potentially indicates that, with practice and feedback, 
students can become more aware of the quality and specific charac- 
teristics of their own writing and, therefore, produce essays that more 
effectively address the prompt question. 

Additionally, results from the current study revealed important 
information about individual differences associated with students’ 
flexible use of narrativity. In particular, flexible writers outper- 
formed the inflexible writers on more general assessments of 
literacy and prior knowledge. Reading comprehension skills were 
most strongly linked to this flexibility, accounting for 34% of the 
variance in students’ narrative distance scores. This finding sug- 
gests that students who were more skilled at comprehending texts 
and potentially more aware of readers’ strategies and cognitive 
processes (e.g., O’Reilly & McNamara, 2007) were also more 
easily able to adapt their writing style to match certain contexts. 

The results from this study are important for writing researchers 
and educators, as they indicate that the link between textual prop- 
erties and writing quality may fluctuate according to the context of 
a given writing assignment. Accordingly, writing proficiency not 
only relates to the sophistication of the words and sentences a 
student produces in a given essay— but also is intimately related to 
the writer’s ability to adapt style, narrative language, and other 
rhetorical content to individual writing assignments and different 
audiences. These results may be explained, in part, by the fact that 
narrativity tends to be an easier writing style to employ for high 
school students. Thus, when they are faced with multiple difficult 
writing assignments, they may resort to this easier writing style as 
a default. Additionally, the results of the individual difference 
analyses suggest that this flexibility is not exclusively related to 
writing proficiency; rather, high school students who are more 
skilled and knowledgeable are better able to adapt the style of their 
writing according to situational variations. 

Although this ability to flexibly adapt to various contexts has been 
anecdotally cited as an important component of writing proficiency 
(Graham & Perin, 2007), to date, little to no research has been 
conducted to empirically test this assumption. The scarcity of research 


on this topic may be related to the difficulties associated with assess- 
ing writing flexibility. First, it requires a longitudinal data set, such as 
the one presented here, wherein students are asked to compose mul- 
tiple essays over time and in response to different prompts. To our 
knowledge, other such data sets have not been reported in the litera- 
ture. Second, flexibility is a complex construct to measure. This is 
particularly true for ill-defined domains, such as writing, which rely 
on human subjectivity to render judgments about quality and style. 
Standardized writing assessments typically only measure high school 
students’ writing ability in one particular context and, therefore, 
cannot be sensitive to fluctuations in style, or in an individual’s 
adaptation to different contexts. If researchers and educators aim to 
develop assessments that can truly capture students’ writing profi- 
ciency, it is important to remain sensitive to their ability to adapt their 
style and language choices according to different assignments and 
contexts. 

The findings and methodologies presented here have important 
implications for the assessment of students’ writing proficiency. In 
particular, our study indicates that the linguistic properties that 
interact to predict writing quality may be inconsistent from assess- 
ment to assessment. Unfortunately, in their current state, standard- 
ized assessments of writing proficiency typically only collect a 
single writing sample from students. Thus, they are unable to take 
the construct of writing flexibility into account when making 
judgments about proficiency. This may constitute a critical over- 
sight. Standardized assessments of writing have a strong influence 
on students’ ability to enter college, as well as their receipt of 
scholarships and other such opportunities. This study suggests that 
standardized test developers should aim to develop more sophis- 
ticated assessments that can capture students’ writing skills across 
a number of different contexts. Additionally, in the future, the 
techniques used in the current study may be integrated into a 
number of educational environments to better assess and improve 
students’ writing skills. For instance, ITSs are computer-based 
educational environments that provide adaptive instruction and 
feedback to students based on their skills and performance. 
Writing-based ITSs might take advantage of this technique to 
provide feedback that not only looks at students’ individual essays 
but also captures their flexibility across multiple time points (Al- 
len, Jacovina, & McNamara, in press). 

Notably, the results reported here call for replications across 
different populations and skill levels of writers and different writ- 
ing genres. To our knowledge, there are currently no other data 
sets that would support replications of the current work. Thus, one 
goal of our future research will be to develop a corpus that contains 
multiple essays from different genres written by students from 
varying populations and skill levels. The achievement of this goal 
will help us to investigate a number of unanswered questions and 
concerns. Successful authors of persuasive essays, for example, may 
flexibility adapt their narrativity; however, in other genres, this flex- 
ibility may not be a positive writing characteristic. Future research 
will aim to answer this question as well as a number of other questions 
that currently remain unanswered. For example, is it the case that 
flexibility for all linguistic properties is positively related to essay 
quality? Or, are certain properties more consistently important across 
a number of different assignments? Further, this study points to the 
importance of feedback in promoting writing flexibility. This finding 
prompts the following questions: Can students be trained to be more 
flexible in their writing style? What is the role of feedback in the 
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Promotion of increased writing flexibility? Finally, what cognitive 
processes relate to students’ flexible use of writing styles? Is this 
driven by some executive component skill, or is this driven more 
broadly by students” prior knowledge and use of strategies? Studies 
aimed at answering these (and other) questions have the potential to 
provide crucial information about the role of flexibility in students” 
ability to produce high-quality text. 
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Appendix 


Pretest and Posttest Essay Prompts 


Essay Prompt 1. You will now have 25 minutes to write an essay on 
the prompt below. 


The essay gives you an opportunity to show how effectively you 
can develop and express ideas. You should, therefore, take care to 
develop your point of view, present your ideas logically and 
clearly, and use language precisely. 

Think carefully about the issue presented in the following ex- 
cerpt and the assignment below. 

Whereas some people promote competition as the only way to 
achieve success, others emphasize the power of cooperation. In- 
tense rivalry at work or play or engaging in competition involving 
ideas or skills may indeed drive people either to avoid failure or to 
achieve important victories. In a complex world, however, coop- 
eration is much more likely to produce significant, lasting accom- 
plishments. 

Do people achieve more success by cooperation or by compe- 
tition? 

Plan and write an essay in which you develop your point of 
view on this issue. Support your position with reasoning and 
examples taken from your reading, studies, experience, or ob- 
servations. 


Essay Prompt 2. You will now have 25 minutes to write an essay on 
the prompt below. 


The essay gives you an opportunity to show how effectively you 
can develop and express ideas. You should, therefore, take care to 
develop your point of view, present your ideas logically and 
clearly, and use language precisely. 

Think carefully about the issue presented in the following ex- 
cerpt and the assignment below. 

All around us appearances are mistaken for reality. Clever adver- 
tisements create favorable impressions but say little or nothing about 
the products they promote. In stores, colorful packages are often better 
than their contents. In the media, how certain entertainers, politicians, 
and other public figures appear is sometimes considered more impor- 
tant than their abilities. All too often, what we think we see becomes 
far more important than what really is. 

Do images and impressions have a positive or negative effect on 
people? 

Plan and write an essay in which you develop your point of view 
on this issue. Support your position with reasoning and examples 
taken from your reading, studies, experience, or observations. 
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Reading comprehension growth trajectories from 3rd to 7th grade were estimated for 99,919 students on 
a state reading comprehension assessment. We examined whether differences between students in general 
education (GE) and groups of students identified as exceptional learners were best characterized as stable, 
widening, or narrowing. The groups included students with disabilities (SWD) from 8 exceptionality 
groups and 2 groups of academically gifted students (AG). Initial reading comprehension achievement 
differed. for all exceptionalities. Controlling for sociodemographic variables, small, but statistically 
significant differences in growth rate were observed, with SWD groups growing more rapidly and AG 
groups growing more slowly than GE students. Given that differences in growth for SWD were small 
relative to the magnitude of the initial achievement gaps, the observed pattern of growth was one of stable 
differences. There was evidence of some narrowing of the achievement gap for students identified with 
learning disabilities in reading. The findings were interpreted within the simple view of reading where 
increases in word recognition skills for SWD in the grade range examined may have accounted for their 
more rapid growth in reading comprehension relative to GE students. The findings suggest that similar 
expectations for rate of reading growth for GE students and SWD might be incorporated into growth- 
based accountability models, but they also suggest that reading comprehension growth sufficient to have 
an impact on SWD achievement gaps does not routinely occur in typical educational practice. 
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The purpose of this study was to examine reading comprehension 
achievement growth and gaps across Grades 3 to 7 for students with 
disabilities (SWD) in comparison to students in general education 
(GE) and students identified as academically/intellectually gifted 
(AG). A particular focus was the developmental pattern of individual 
differences observed for reading comprehension achievement by ex- 
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ceptionality, and whether the differences between the focal groups 
and a comparison group of GE students were best characterized as 
increasing, decreasing, or remaining stable across the grade span 
examined. 

Reading comprehension is widely considered “the essence of 
reading” (Durkin, 1993) and a critical outcome of schooling (Na- 
tional Institute of Child Health & Human Development, 2000). 
SWD often encounter difficulty acquiring reading skills (Black- 
orby et al., 2005). For example, on the 2013 National Assessment 
of Educational Progress (NAEP; U.S. Department of Education, 
Institute of Education Sciences, National Center for Education 
Statistics, 2014), in comparison to students without disabilities 
(SWoD), much lower percentages of SWD reached the “profi- 
cient” level or above in reading in Grade 4 (10 vs. 38%), Grade 8 
(7 vs. 39%), or Grade 12 (8 vs. 40%). 

Historically, obtaining a comprehensive picture of SWD reading 
achievement growth and gaps across grades has been problematic 
because many of these students have been excluded from large 
scale achievement testing programs (Koretz & Hamilton, 2006; 
McDonnell, McLaughlin, & Morison, 1997); excluded from lon- 
gitudinal studies of reading growth (e.g., Huang, Moon, & Boren, 
2014); or included in longitudinal studies, but without their growth 
examined separately (e.g., Rescorla & Rosenthal, 2004). Although 
cross sectional depictions of the achievement gap for SWD indi- 
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cate that the gap widens across grades (e.g., Chudowsky, Chu- 
dowsky, & Kober, 2009; U.S. Department of Education, Institute 
of Education Sciences, National Center for Education Statistics, 
2014), such depictions are unlikely to represent the gaps that 
would be observed if students were followed longitudinally. Stu- 
dents’ entrances and exits from special education are related to 
their achievement with lower achieving students in general edu- 
cation entering special education and higher achieving students in 
special education exiting, and this pattern affects the size of 
observed SWD achievement gaps (Schulte & Stevens, 2015; Ys- 
seldyke & Bielinski, 2002). A second issue in obtaining accurate 
information about longitudinal achievement gaps for SWD is that 
the group encompasses students who are receiving special educa- 
tion in 13 disability categories where the effect of the disability on 
reading achievement varies considerably (Blackorby et al., 2005; 
Wei, Blackorby & Schiller, 2011). Treating SWD as a single group 
may mask differences in reading achievement trajectories that 
have implications for intervention as well as for accountability 
policies (Buzick & Laitusis, 2010; Temple-Harvey & Vannest, 
2012; Wei et al., 2011). 

Although students who are AG are not at risk for low achieve- 
ment, obtaining a comprehensive picture of reading achievement 
growth for this group of students also presents challenges. Unlike 
SWD, AG students generally have been included in longitudinal 
studies of achievement growth, but only a few studies have exam- 
ined growth for these students separately (e.g., Rambo-Hernandez 
& McCoach, 2015). Advocacy groups have expressed concern that 
the current focus on grade level proficiency in the No Child Left 
Behind Act of 2001 (NCLB, 2002) may result in less focus on 
enhancing achievement outcomes for students who are AG be- 
cause they are likely to score above proficiency standards in each 
grade (Council for Exceptional Children, 2010; National Associ- 
ation for Gifted Children, 2014). These advocacy groups have 
argued for an increased focus on achievement growth rather than 
status in accountability models. A similar argument has been made 
for a focus on achievement growth for SWD because a substantial 
number of students in this group may score far enough below the 
cutpoint for grade-level proficiency that improved outcomes may 
go unrecognized when the accountability focus is on the single 
cutpoint for grade level proficiency (Buzick & Laitusis, 2010). 

Given the increasing interest in using achievement growth as a 
key outcome in school accountability models (e.g., Hoffer et al., 
2011), lack of information about reading achievement growth for 
children with exceptionalities is problematic, whether they are 
SWD or students who are AG. The potential of achievement 
growth to offer more fair and valid measures of achievement 
progress depends on a normative understanding of student 
achievement growth including the nature and likely range of 
interindividual differences in observed growth and how these 
interindividual differences change across grades. When the growth 
expectations incorporated into school accountability policies are 
not based on empirical evidence, policy validity is threatened 
(Harris, 2009; Lee, 2004) given that inferences about teacher and 
school performance may be inaccurate. 

Although an accurate picture of reading achievement growth 
across grades and how it differs among groups of children has 
implications for school accountability models, it also can inform 
models of the development of reading. With the wide-scale im- 
plementation by the United States of annual testing in reading and 


mathematics across Grades 3 to 8, datasets are now available 
where the achievement of large numbers of students can be tracked 
across grades, often using vertical scales (Dadey & Briggs, 2012). 
One potential use of these datasets is an examination of how 
different theoretical predictions of individual and group perfor- 
mance (e.g., Baumert, Nagy, & Lehmann, 2012; Stanovich, 1986) 
fit with the observed growth trajectories for children participating 
in annual testing programs. Such datasets also can provide descrip- 
tions of achievement growth for groups of students where longi- 
tudinal studies have been scarce and often conducted on small 
samples (e.g., Scarborough & Parker, 2003). 


‘Theory and Research on Reading 
Comprehension Growth 


The “simple view of reading” is the premise that reading com- 
prehension is the joint product of word identification and language 
(listening) comprehension (Gough & Tunmer, 1986). Although it 
is not a complete account of the skills underlying reading com- 
prehension (Perfetti, Landi & Oakhill, 2005; Vellutino, Tunmer, 
Jaccard, & Chen, 2007), as a general framework it is useful for 
characterizing reading development and the nature of reading 
difficulties, as well as identifying key areas for instruction and 
remedial interventions (e.g., Compton, Miller, Elleman, & Steacy, 
2014; Garcia & Cain, 2013). Within the simple view of reading, 
word identification and language comprehension are viewed as 
largely independent contributors to reading comprehension, deter- 
mined by underlying skills (e.g., phonological awareness and rapid 
decoding for word identification; semantic and syntactic knowl- 
edge for language comprehension) that have limited overlap (Vel- 
lutino et al., 2007). The relative contribution of the two compo- 
nents varies at different stages of reading acquisition (Tighe & 
Schatschneider, 2014). Language comprehension plays a smaller 
role until readers develop enough facility in word identification to 
be able to fluently decode text at or near their ability to understand 
spoken language—at age 9 or 10 for most students (Garcia & Cain, 
2013; Vellutino et al., 2007). 

The simple view of reading has two important implications for 
the study of reading achievement growth and gaps for SWD. First, 
it follows from the model that there are two major sources of 
reading difficulties for children, deficits in word identification and 
deficits in language comprehension, which can occur separately or 
together (Compton et al., 2014). These difficulties are likely to be 
distributed differently among children in the various exceptionali- 
ties served in special education, depending on the nature and 
severity of the disability. For example, most students with learning 
disabilities (LD) in reading have marked difficulties in word 
recognition, but adequate language comprehension skills (Vellu- 
tino, Fletcher, Snowling, & Scanlon, 2004). In contrast, students 
with intellectual disabilities show impairments in both areas (Wise, 
Sevcik, Romski, & Morris, 2010). 

Second, as the relative influence of word identification and 
language comprehension on reading comprehension changes, it is 
likely to affect the rate of reading comprehension growth. When 
children acquire foundational skills in decoding, their sight word 
vocabularies expand rapidly (Ehri, 2005; Ehri & Snowling, 2004), 
removing word identification as an initial bottleneck and allowing 
rapid growth in reading comprehension (Scarborough & Parker, 
2003). However, when word identification skills have reached a 
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level commensurate with listening comprehension skills, reading 
comprehension growth is likely to slow as skills such as inference 
making, comprehension monitoring, and lexical knowledge be- 
come important determinants of further reading growth (Oakhill, 
Cain, & Bryant, 2007; Perfetti et al., 2005). With regard to SWD, 
if the disability slows initial acquisition of decoding skills, but 
students eventually acquire them, rapid growth may occur, but at 
a later point than for SWoD. To the extent that the disability also 
affects language comprehension skills, if students’ language com- 
prehension skills are compromised, then growth may slow again. 


Reading Growth Across Grades 


Several studies have made use of large national datasets to 
investigate the nature of reading growth across the school years 
(e.g., Bloom, Hill, Black, & Lipsey, 2008; Lee, 2010). The general 
finding is that reading growth is curvilinear with large increases in 
the early grades that decelerate as students’ progress through 
school. For example, Lee (2010) made use of long-term NAEP 
data, several national datasets, and national norms from standard- 
ized achievement tests to examine the nature of student growth in 
reading and mathematics. He characterized national achievement 
growth trajectories as having “remarkable consistency and stability 
across different tests and cohorts over the long term” (Lee, 2010, 
p. 825). 


Individual Differences in Reading Growth and 
Developmental Patterns 


In individual differences research, investigators seek to under- 
stand the sources of group and individual variation that occur 
within a general developmental pattern (Pennington, 2002). In 
terms of individual differences in reading growth, three general 
developmental patterns have been described (Pfost et al., 2014). 
The first is a pattern where growth in students with higher initial 
literacy levels outpaces growth for students with lower initial 
literacy levels, resulting in widening achievement gaps over time. 
This pattern is often termed a Matthew effect, or a cumulative 
growth or fan-spread pattern (Morgan, Farkas, & Hibel, 2008; 
Stanovich, 1986). The second pattern, termed a compensatory 
growth or fan-close pattern (Francis, Shaywitz, Stuebing, Shay- 
witz, & Fletcher, 1996), is one where students with lower initial 
literacy levels show more growth than students with higher levels 
of initial literacy, resulting in narrowing of the achievement gap 
over time. The final pattern is one of stable differences among 
students where growth for students at different initial achievement 
levels remains parallel over time. 

Although different mechanisms are thought to produce each of 
these developmental patterns (Pfost et al., 2014), there are also 
multiple mechanisms that might underlie each of the patterns. In 
terms of the fan-spread pattern, Stanovich (1986) proposed that 
reciprocal relationships between reading development and the fac- 
tors enhancing it result in widening individual differences in 
reading achievement over time. Many of the self-reinforcing rela- 
tionships he described were the result of differences in children’s 
emergent literacy skills altering their motivation to read and op- 
portunity to practice reading skills, ultimately affecting the devel- 
opment of new vocabulary and other reading skills, and resulting 
in accelerated or slowing growth (Pfost et al., 2014). However, 


other mediating mechanisms for the fan-spread pattern are also 
possible (Baumert et al., 2012; Morgan et al., 2008; Scarborough 
& Parker, 2003). For example, Baumert et al. (2012) distinguished 
between individual- and status-driven fan-spread effects, with cog- 
nitive or behavioral characteristics producing cumulative effects at 
the individual level, but status-driven effects resulting from ad- 
vantages or disadvantages afforded different groups (e.g., an as- 
sociation between student poverty and lower quality school envi- 
ronments that has a cumulative impact on reading achievement). 

Possible explanatory mechanisms for fan-close effects in read- 
ing include (a) developmental lags that narrow individual differ- 
ences as the initially lower group catches up and initially higher 
achieving students’ growth plateaus (Francis et al., 1996); (b) 
insufficient opportunities to learn for initially higher achieving 
students resulting in slowing growth (Rambo-Hernandez & Mc- 
Coach, 2015); or (c) strong compensatory or remedial education 
services for initially low achieving students (Baumert et al., 2012). 
Finally, for the stable differences pattern, Baumert et al. proposed 
two possible mechanisms. One was that growth at each time period 
was not a cumulative result of all previous reading experiences but 
simply a function of achievement at the immediately previous time 
point. The second possibility was that multiple growth influences 
could be simultaneously operative and moving in opposite direc- 
tions, such as a fan-close pattern influencing individual develop- 
ment and a status-driven pattern producing a fan-spread effect, 
together producing a pattern of stable differences across time. 

In a meta-analysis that examined the research evidence for 
Matthew effects in reading across 25 years, Pfost et al. (2014) 
found no support for the existence of a single developmental 
pattern that characterized reading skill growth. Instead, there was 
evidence that developmental patterns varied by the type of reading 
skill assessed. For example, constrained skills, such as letter 
knowledge, phonics, and concepts of print, were more likely to 
show a fan-close developmental pattern. Decoding speed was more 
likely to show a fan-spread or stable differences pattern, and 
reading comprehension was more often associated with a stable 
differences or fan-close pattern. Pfost et al. (2014) also found that 
methodological features of studies affected the developmental 
pattern observed. Use of reading measures with lower reliability 
(<.90) or ceiling or floor effects were more likely to result in a 
fan-close pattern, an indication that measurement error and regres- 
sion toward the mean could be contributory factors when this 
pattern is observed. 


Methodological Considerations in Examining 
Achievement Growth Patterns 


Pfost et al. (2014) specifically excluded studies focusing on 
SWD from their meta-analysis; however, their study has several 
implications for studies of SWD growth patterns and the design of 
the present study. First, the finding that developmental patterns 
were related to the reading skill studied highlights the importance 
of examining reading growth with measures tapping individual 
reading skills (i.e., reading vocabulary, decoding, and reading 
comprehension) rather than a composite measure. This is particu- 
larly important given that individual component skills have differ- 
ent relationships to sociodemographic and home literacy variables 
(Hecht, Burgess, Torgesen, Wagner, & Rashotte, 2000) and the 
relative weighting of component skills within composite measures 
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changes across grades. Use of a composite reading measure may 
explain why studies of longitudinal achievement gaps for different 
sociodemographic groups sometimes show abrupt shifts in the 
pattern of individual differences across grades (e.g., Chatterji, 
2006; Kieffer, 2012). 

Second, use of a reading measure with high reliability and 
adequate floor and ceiling is important to prevent misattributing 
testing artifacts to developmental patterns in the distribution of 
individual differences over time. Although not specifically men- 
tioned by Pfost et al. (2014), a related complication in understand- 
ing patterns of academic growth is the dependence of interpreta- 
tions on the adequacy of the developmental scale (Briggs & 
Weeks, 2009). There are a number of technically sound ap- 
proaches to creating vertical scales that provide a basis for scale 
linkage—usually a common set of examinees over grades or 
embedded linking items at two or more grade levels (see Kolen & 
Brennan, 2004). Nonetheless, interpretation of patterns of growth 
can easily be influenced by details of the scaling model applied, 
the design and number of linking items, differences in content 
specification over time, or lack of correspondence between the IRT 
scale and the underlying theoretical ability dimension represented 
(Bolt, Deng, & Lee, 2014; Zwick, 1992). 

Third, as discussed by Baumert et al. (2012), different student 
characteristics may be associated with different developmental 
patterns operative at the same time. Therefore, controlling for 
covariates is important when examining developmental patterns of 
reading growth associated with particular student characteristics. A 
number of student characteristics associated with lower reading 
comprehension achievement or growth, such as being male; eco- 
nomically disadvantaged; lacking English proficiency; or being of 
Black, Hispanic, or American Indian race/ethnicity also vary by 
SWD status and by specific exceptionality (Wei et al., 2011). 
Including sociodemographic variables as controls when these char- 
acteristics are not the primary focus of study is important to avoid 
confounding group differences because of the characteristic of 
interest versus group differences that result from differential so- 
ciodemographic composition among groups (Morgan et al., 2011). 


Reading Growth and Longitudinal Achievement 
Gaps for SWD 


Although a number of investigators have examined reading 
comprehension achievement growth trajectories for children as a 
function of demographic characteristics (Huang et al., 2014; Kief- 
fer, 2012), or lower and higher initial reading achievement (e.g., 
Protopapas Sideridis, Mouzaki, & Simos, 2011), surprisingly few 
investigators have examined reading comprehension growth and 
gaps across grades for SWD. To date, only three published studies 
have (a) examined growth in overall reading or reading compre- 
hension achievement, (b) included SWD and SWoD, and (c) also 
controlled for sociodemographic differences (Francis et al., 1996; 
Judge & Bell, 2010; Morgan et al., 2011). Across these three 
studies, only two exceptionality groups were examined: students 
with LD, included in all three studies; and students with speech- 
language impairments (SLI), included in one study (Judge & Bell, 
2010). Each of the studies used composite reading measures (al- 
though Francis et al. reported their results remained the same when 
decoding and reading were examined separately). Two of the 
studies, Judge and Bell, and Morgan et al., used school-identified 


students with LD, included students with LD in any academic area, 
and examined growth from Grades K to 5. Francis et al. used 
researcher-identified students with LD in reading only, based on an 
IQ/achievement discrepancy at 3rd grade, and examined reading 
growth across Grades 1 to 9. 

Although all three studies found students with LD had lower 
initial reading achievement compared with SWoD, only Judge and 
Bell (2010) found the fan-spread pattern predicted by Stanovich 
(1986). Morgan et al. (2011) and Francis et al. (1996) both found 
stable difference patterns with achievement gaps remaining similar 
across the grade spans studied. In the one study including students 
with SLI, Morgan et al. (2011) found that students with SLI 
showed lower kindergarten reading achievement than the reference 
group and fell further behind across grades. A notable feature of 
the Morgan et al. study was the use of additional control variables 
beyond student demographic characteristics in successive models. 
When a teacher rating of students’ “approach to learning” (i.e., 
attentiveness, task persistence, eagerness to learn, adaptability, and 
organization) was included in the model, the differences in inter- 
cepts between the two disability groups and the reference group 
dropped substantially. Teacher ratings on the approaches to learn- 
ing measure also were positively related to reading achievement 
intercept and linear growth. 

A small number of additional studies also examined longitudinal 
reading achievement gaps for students with LD or SLI, but did not 
control for differences in sociodemographic variables. Findings 
from these studies have been mixed, with two reporting substantial 
reductions in the achievement gap across grades; Scarborough and 
Parker (2003) on both a reading composite and comprehension 
measure for students with LD, and Skibbe et al. (2008) using a 
composite reading measure and tracking students with language 
difficulties identified before school entry. One study reported 
stable achievement gaps for both word recognition and reading 
comprehension for students with language impairments (Catts, 
Bridges, Little, & Tomblin, 2008), and one reported a widening 
achievement gap in reading comprehension for students who were 
LD (McKinney & Feagans, 1984). In summary, studies of longi- 
tudinal achievement gaps in reading for SWD are quite limited, 
have been restricted to only two specific exceptionalities, and do 
not consistently report a fan-spread pattern. 

One additional study by Wei et al. (2011) is relevant to the 
present study because it addressed reading growth for students in 
a much broader range of exceptionality categories; however, it did 
not include a comparison group of SWoD. Using students with LD 
as the reference group, Wei et al. examined level of reading 
achievement and curvilinear growth for students in 10 of the 12 
remaining exceptionality categories recognized in the Individuals 
with Disabilities Education Act (IDEA, 2004). In terms of reading 
comprehension, all exceptionality groups showed curvilinear 
growth in reading comprehension achievement from age 7 to 17. 
Level of reading comprehension achievement (at the average age 
of 12.67) differed for nine of the 10 exceptionalities, with four 
groups scoring significantly lower than students with LD (students 
with intellectual disabilities, multiple disabilities, autism, and hear- 
ing impairment), and five significantly higher (students with or- 
thopedic impairment, emotional disturbance, other health impair- 
ment, visual impairment, and SLD. Linear change coefficients 
were largely comparable across exceptionality groups, although 
some groups had small but significantly lower linear slope coef- 
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ficients. None of the exceptionalities differed significantly from 
students with LD in terms of quadratic curvature. 


Reading Growth for Students Who Are AG 


As noted earlier, research on achievement growth for the AG 
exceptionality group is quite limited. Warne (2014) used above- 
level achievement testing to examine academic growth in gifted 
students. Compared with annual expected growth based on the 
reading test norms for students who were two grades higher than 
the students in the study, he found that male students who were AG 
made less or similar growth on a reading composite test and 
females who were AG made more growth. In another study, 
Rambo-Hernandez and McCoach (2015) examined reading growth 
from 3rd to 6th grade for students who had scored in the top 2% 
of students nationally on a composite measure of reading achieve- 
ment in kindergarten. Compared with students who had average 
reading achievement in kindergarten, AG students grew more 
slowly during the academic year in Grades 3 to 6, but more rapidly 
than the average-achieving students during the summer. Rambo- 
Hernandez and McCoach also found that gifted students’ reading 
growth in Grades 4 and 5 increased relative to their growth rate in 
3rd grade, a pattern different than the curvilinear growth pattern 
observed for other student populations. This result suggests that 
the pattern of reading growth for gifted students may differ in 
functional form from the pattern observed for most students (e.g., 
Bloom et al., 2008). 


Study Purpose and Research Questions 


In summary, there is increasing interest in incorporating mea- 
sures of student growth into programs that report student achieve- 
ment outcomes for monitoring or accountability purposes. How- 
ever, there are only a limited number of studies examining reading 
growth and gaps, with no published studies that have (a) examined 
a broad range of exceptionality groups identified in federal policy 
(IDEA, 2004), (b) included a comparison group without disabili- 
ties, and (c) controlled for student sociodemographic characteris- 
tics. The component skills that account for growth in reading 
comprehension differ across grades, with growth in decoding or 
word identification skills likely to be a more important factor in 
reading comprehension growth in the early grades, and growth in 
language comprehension and higher level cognitive skills more 
important in later grades. This developmental change makes the 
use of composite measures in studying reading growth problem- 
atic, but may also affect the apparent pattern of observed growth in 
reading comprehension when student groups differ markedly in 
their word identification skills and as a result what component 
skills account for growth during that age span. Lastly, few reading 
growth studies have used states’ large-scale reading assessments, 
the primary outcome measure for student achievement in the 
NCLB (2002) school accountability framework. The purpose of 
this study was to address two fundamental questions concerning 
achievement growth in reading comprehension for SWD: 


1. Controlling for demographic differences between groups, 
what is the developmental progress in reading compre- 
hension for GE students and students in specific excep- 
tionality groups (including AG students) on a statewide 
achievement test used for accountability purposes? 


2. Do SWD and AG students show a fan-spread, fan-close, 
or stable growth pattern in reading comprehension from 
3rd to 7th grade or any changes in reading achievement 
gaps relative to GE students? 


Method 


Sample 


The initial sample for this study was all North Carolina students 
who were in the 3rd grade in the 2002-2003 school year, had 
participated in end of grade achievement testing, and had not been 
retained in 3rd grade from the previous year (V = 101,885; see 
first column of Table 1 labeled “Total sample”). The analytic 
sample was created by excluding students who did not have a 
unique identifier in 2003 (VN = 5, < 0.1%); (b) did not have 
complete demographic information in Grade 3 (N = 27s 0m); 
or (c) had never participated in the large scale reading test in 
Grades 3 to 7 (N = 1,772, 1.7%). After all these exclusions had 
been applied, the number of students in some exceptionality cat- 
egories (i.e., multiple disabilities, orthopedic impairment, trau- 
matic brain injury, and visual impairment) was 100 or less and too 
small to ensure stable statistical estimation, and students from 
these categories were excluded (N = 162, < 0.2%). When all 
students meeting one or more of these exclusion criteria had been 
eliminated, the analytic sample consisted of 99,919 students 
(98.1% of the students in the state test data file). Characteristics for 
these students are provided in Table 1 under “Analytic sample”). 
The percent of SWD in the total sample (14.4%) was slightly 
higher than the percent of students served in public education 
nationally (12.9%), and the proportions of SWD served within the 


Table 1 
Student Disability Group for the Total and Analytic Sample at 
Wave 1 


i 


Total Analytic 
sample sample 
Characteristic N % N % h 
ARE SR as, etter at Spe ee en aI eee Spee eens RON Dee eee 
Students without disabilities 87.226 85.6 87,0287 871 
General education 80,182 78.7 79,984 80.0 .033 


Academically gifted, reading 5,695 SO D095 hae 005 


Academically gifted, other 1,349 13 1,349 14 .002 
Students with disabilities 14,642 144 12,891 12.9 .043 
Autism 395 4 204 2 .034 
Deaf-blindness — — — 
Emotional disturbance 739 a 701 He 003 
Hearing impairment 181 2, 159 2  .005 


Intellectual disability 2,186 2.1 180 peel ieee OGT 


Learning disability, reading 4,827 47 4,668 4.7 .003 
Learning disability, other asi al 1,140 1:1 001 
Multiple disabilities 115 all _ = 
Orthopedic impairment 83 ri _ — 
Speech-language impairment 3,044 3:0) 3.0135 320002 
Other health impairment 1,832 1.3 13726> tal R005 
Traumatic brain injury 2011 — os 
Visual impairment 67 0.1 _ — 
Missing exceptionality Ma <On — 


Total sample size 101,885 99,919 


Note. The cell frequency for students with deaf blindness was <10, so 
this group is not reported. 
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different exceptionality categories were generally comparable with 
national figures except that the proportion of students identified 
with intellectual disabilities was somewhat higher and the propor- 
tion of students served with autism somewhat lower (U.S. Depart- 
ment of Education, Institute of Education Sciences, National Cen- 
ter for Education Statistics, 2013). 

To create the longitudinal sample, the students who were present 
in the database in 2002-2003 in Grade 3 were matched to all 
succeeding years of test data through Grade 7 (2006-2007), when 
the state introduced a new edition of the reading test. Of the 99,919 
students in the analytic sample, 80.5% had reading scores in all 5 
years, 6.7% had scores in 4 years, 5.1% had scores in 3 years, 3.6% 
had scores in 2 years, and 4.1% had one reading score during the 
5-year study period. Reasons for missing achievement data in- 
cluded student absence, administration of an alternate assessment 
in that year, or leaving the state school system. 

We examined the extent to which the analytic sample differed 
from the total sample in terms of representation of the different 
exceptionalities using z tests of the difference between two pro- 
portions. Given the sample size, even quite small differences 
between the total and analytic sample were statistically significant 
(p < .05) and a measure of effect size (ES), Cohen’s h, was used 
as a means to interpret differences. Cohen (1988) suggested that in 
the absence of knowledge of the range of typical ES values found 
in an area of study, h values of .20 be considered small, .50 as 
medium, and .80 and greater as Jarge. Given these guidelines, all 
differences in proportions of SWD were quite small, ranging from 
0.01 to 0.07. The largest was the ES for the proportion of students 
with intellectual disabilities represented in the analytic versus total 
sample (ES = .07). Just over 40% of the students with intellectual 
disabilities never participated in the large-scale reading achieve- 
ment testing, instead taking an alternate assessment. 

We also examined how the total sample differed from the 
analytic sample in terms of sociodemographic characteristics, and 
how the SWoD and SWD groups within the analytic sample 
differed from each other (see Table 2). The SWD group had a 
higher proportion of males than the SWoD group (67 vs. 49%, h = 
0.38) and more students participating in the free lunch program, an 
indicator of economic disadvantage (48 vs. 37%, h = .24). These 
ES were not small and reflect the higher likelihood that males and 
children in poverty are placed in special education (U.S. Depart- 


ment of Education, Institute of Education Sciences, National Cen- 
ter for Education Statistics, 2007). 


\, 


Measures 


For all analyses, the outcome measure was the student devel- 
opmental scale score on the standardized, second edition North 
Carolina End of Grade Reading Comprehension Tests (EOG-RC) 
at the grade level in which the students were placed that year. A 
technical manual for the second edition was published by the 
North Carolina Department of Public Instruction (NCDPI, 2004a) 
and provides information on the test construction process, test 
reliability and'validity, and the procedures used to construct the 
developmental scale. At each grade level, there were three alter- 
nate forms of the test, each consisting of 50 to 56 multiple-choice 
items intended to measure the four strands in the state English/ 
Language Arts curriculum: (a) cognition, (b) interpretation, (c) 
critical stance, and (d) connections. Average internal consistency 
reliability estimates across forms were above .90 for Grades 3 to 8, 
with SEM of 2 to 3 developmental scale score points for the 
majority of respondents scoring within 2 SDs of the grade level 
mean, and as large as 6 points for respondents at the extremes of 
the score distribution. The EOG-RC developmental scale range for 
the grade span examined in this study was a low score of 216 in 
3rd grade and a high score of 287 in 7th grade. Examination of 
score distributions for our analytic sample indicated adequate test 
score ranges within each grade and no evidence of floor or ceiling 
effects overall or for specific student groups. Validity evidence 
provided in the technical manual (NCDPI, 2004a) included high 
teacher ratings of item alignment with the reading curriculum and 
moderate correlations of the developmental scale scores with 
teacher ratings of students’ expected grades in English/Language 
Arts (median correlation of .58 across the Grades 3 to 8) and 
judgment of student achievement (median correlation of .63 across 
Grades 3 to 8). Drawing from other sources for validity evidence, 
the EOG-RC have been found to correlate highly with other group 
administered reading tests, including the STAR reading assess- 
ments (r range of .74—.80 by grade; Renaissance Learning, 2012) 
and the Measures of Academic Progress (r range of .77—.82 by 
grade; Northwest Evaluation Association, 2014). 








Table 2 
Student Demographic Characteristics by Sample, and by Student Group for the Analytic Sample at Wave 1 
Total sample Analytic sample SWoD SWD 

Characteristic N % N h N % N % h 
Female 49,564 48.6 48,875 48.9 .005 44,644 aS 4,231 32.8 TT 
American Indian 1,515 iS 1,479 eS) 001 1,254 1.4 225 Le .024 
Asian 2,084 2.0 2,035, 2.0 .001 1,919 Dr, 116 9 .108 
Black 29,905 29.4 29,178 29.2 .003 25,059 28.8 4,119 32.0 .069 
Hispanic 7,140 7.0 6,911 6.9 004 6,304 2 607 4.7 108 
Multi-racial 2,474 2.4 2,439 2.4 001 2,140 25 299 D3 .009 
White 58,767 S77, 57,877 57.9 005 50,352 SHES) TLD 58.4 .010 
Limited English proficiency 5,140 5.0 4,920 4.9 .006 4,464 Sel 456 3S .079 
Title I student 3,741 Si 3,690 Bi .001 3,236 BM 454 B15) O11 
Free lunch 39,124 38.4 37,968 38.0 008 31,754 36.5 6,214 48.2 .238 
Total sample size 101,885 99,919 87,028 12,891 


a et ree os crear e tk Semmes sees riser Aehiiimet Dei eck 
Note. SWoD = students without disabilities; SWD = students with disabilities. 
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The developmental scale scores were created based on a vertical 
linking study using a common items design (Nicewander et al., 
2013; NCDPI, 2004a). As discussed earlier, creation of vertical 
scales requires a number of procedures and design features to 
support interpretability. Patz (2007) noted that the North Carolina 
Language Arts test content is clearly more amenable to vertical 
scaling than content standards in some other states with North 
Carolina content standards based on common goals that provide 
“". . continuity of language study and increasing language skill 
development” (NCDPI, 2004a, p. 11). The North Carolina linking 
design entailed the use of 12 linking item sets administered to each 
adjacent pair of grades for Grades 3 through 8 (Nicewander et al., 
2013). In addition, “triplet forms” were used to examine linkages 
of scores across spans of three grades. Creation of the vertical scale 
employed a number of procedures recommended for vertical link- 
ing (Briggs & Weeks, 2009; Kolen & Brennan, 2004) including a 
separate linking approach, use of item response theory (3PL) to 
place forms on the common scale, and use of a maximum likeli- 
hood ability estimator. The linking process resulted in a scale that 
displays desirable patterns of grade to grade growth that support 
the vertical scale including mean score increases across grades, 
relatively flat grade to grade variability, and even separation of 
grade distributions (Kolen & Brennan, 2004). Finally, we exam- 
ined the ratio of each grade’s SD to the Grade 3 SD to evaluate 
scale variability and found no evidence of scale shrinkage (Dadey 
& Briggs,’2012) with ratios of —0.01, —0.04, 0.08, and 0.05 for 
each adjacent pair of grades from Grade 3 to 7. 


Procedures 


The North Carolina EOG-RC tests are administered to students 
the last 3 weeks of each school year as part of the state’s educa- 
tional accountability program. Most students take the test in their 
general education classrooms in a single session of 130 min 
(including breaks and instructions), with classroom teachers ad- 
ministering and proctoring the examination. Students may be ex- 
empted from testing or take an alternate assessment for a variety of 
reasons including medical issues, limited English proficiency, or 
determination by an IEP team that a student with a disability 
should participate in the alternate rather than the general assess- 
ment (NCDPI, 2004b). Test accommodations are available for 
SWD and students with limited English proficiency and the per- 
cent of students in the analytical sample receiving accommoda- 
tions each year ranged from 12.2 to 15.4. Across the 5 years of the 
study, the most common test accommodations for the EOG-RC 
were extended time (12.14 to 14.4% of the sample by year), testing 
in a separate room (9.3 to 13.2%), and marking in the test booklet 
rather than an answer sheet (4.4 to 9.4%). 

For the first year of this study, 2002-2003, the 3rd grade 
participation rate in the EOG-RC was 98.8% overall, and 84.0% 
for SWD. Within the special education population, the participa- 
tion rate by disability varied from 5.2% for students identified as 
multihandicapped to 99.2% for students with SLI. 

Determination of student exceptionality. Students’ primary 
exceptionality classifications in the 3rd grade were used to define 
the exceptionality groups in this study. North Carolina identified 
students for exceptional children’s services using category names 
that differed slightly from the present IDEA categories for SWD 
groups. We mapped the North Carolina categories in use at the 


time of testing onto the federal categories (e.g., students identified 
as “mentally handicapped” in North Carolina’s testing database 
were described as students with intellectual disabilities). The NC 
identification criteria for LD identification required an ability or 
achievement discrepancy, and students had to show a standard 
score discrepancy on individually administered IQ and achieve- 
ment tests of 15 standard score units, or (in rare cases) provide 
classroom documentation that a severe discrepancy existed in the 
absence of an ability or achievement discrepancy on standardized 
tests. We divided the LD group into students who were identified 
in reading (LD-R) versus other academic areas (LD-O). North 
Carolina also explicitly recognizes gifted and talented students in 
reading and mathematics as subgroups in their accountability 
reporting system. We created separate groups of students within 
those identified as academically/intellectually gifted, for children 
identified in reading (AG-R) and other areas (AG-O). All other 
students were classified as GE students. 

Construction of the longitudinal file. The study dataset was 
constructed from the annual test and annual student membership 
electronic files available from the North Carolina Educational 
Research Data Center (NCERDC). These files were available for 
each student who attended a North Carolina school during the 
school year in question, even if the student was absent or exempt 
from testing. In addition to test scores, these records contained 
demographic information including students’ exceptional children 
classification as coded by the classroom teacher. The NCERDC 
added a unique identifier for each student to the annual records to 
allow matching of student records across years. To create the 
longitudinal records, we first merged the test and membership files 
for each year, conducted data quality checks on these files, and 
then merged the annual files by student identification number to 
create the longitudinal dataset. 


Analytic Method 


We used two-level hierarchical linear models (Raudenbush & 
Bryk, 2002) to examine the effects of student exceptionality and 
demographic characteristics on 3rd grade reading comprehension 
and subsequent growth across grades. We used full information 
maximum likelihood estimation, specified model parameters as 
random effects, and used data from any available time point for a 
student. The multilevel analyses were completed using HLM 7.0 
(Raudenbush, Bryk, Cheong, Congdon, & du Toit, 2011). Given 
our interest in developmental change, we centered the intercept at 
the first testing occasion (Grade 3). We did not include school as 
a third level in the growth models because our research questions 
did not pertain to school-level differences and use of a third level 
would have resulted in sample attrition because of transitions from 
elementary to middle school and student mobility across schools. 

In model building, we first applied an unconditional growth 
model that served as a basis for comparison with more complex 
models. In the next model, we added dummy-coded predictors for 
student demographic variables. In the final model, we added 
dummy coded predictors representing student exceptionality status 
in Grade 3. We evaluated differences between models using de- 
viance tests and calculation of pseudo-R? statistics. 

In each conditional model, the level-1 model specified student 
EOG-RC scores predicted by a quadratic function of time of 
measurement. The level-2 models were composed of the prediction 
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of level-1 model parameters as a function of student demographic 
characteristics in the second model and student exceptionality 
categories and demographic characteristics in the third model. The 
choice of a quadratic model to examine growth was based on 
previous research characterizing achievement growth as curvilin- 
ear (e.g., Lee, 2010) and preliminary analyses that indicated in- 
clusion of the quadratic term accounted for statistically significant 
additional variance over a linear model. The initial level-1 model 
was as follows: 


(Y,,) = 1%; + 14;(time) + 1,(time”) + 7, (1) 


where Y was the EOG-RC developmental scale score for student i 
at time ¢ and 7; was the initial status or intercept for student i at 
Time 0 (Grade 3), 7,; was the initial linear change, 7,, was the 
quadratic curvature representing the acceleration or deceleration in 
each student’s growth trajectory, and r,, was the residual for each 
student. 

At level-2, the individual student intercept and slope estimates 
became the criterion variables predicted by the level-2 student 
characteristics. All predictors were dichotomous and uncentered, 
with the coefficient representing the effect for the group coded 
one. The level-2 equations for the reading initial status and growth 
parameters were as follows: 


Initial Status, ™;= Boo + >) Box(Predictor,) + uo; (2) 
Linear Change, 7; = Bio + pa 8,,(Predictor,) + uj; (3) 


Curvature, 7; = Boy + ey B>,(Predictor,) + up; (4) 


where Boo was the reading score intercept at Grade 3 for all 
students, each B,, represents the average partial regression coef- 
ficient relating the predictor of interest to student’s initial status, 
and up; is the residual between the fitted predicted value for each 
student and the student’s observed reading score. For each rate of 
change parameter (i.e., 7,, linear growth and 7,,, curvilinear 
change), each individual’s change parameter, 7,;, was modeled as 
a function of the average reading comprehension rate of change, 
Boo: Each B,, represents the average partial regression coefficient 
relating the predictor of interest to students’ change parameters, 
and u,; was the residual between each student’s fitted growth 
parameter of interest and the average parameter across all students. 

The final step in our overall analysis strategy was the calculation 
of empirical Bayes (EB) estimated means and achievement gap ES 
at each grade. For ES, we calculated a model-based ES by sub- 
tracting the EB estimated mean for the GE students from the EB 
estimated means for each exceptionality group obtained from our 
final HLM model and dividing the estimated group difference by 
the square root of the sum of the level-1 and level-2 model 
variance components (Spybrook, Raudenbush, Liu, Congdon, & 
Martinez, 2008). To provide a comparison with the more descrip- 
tive, model independent type of ES more often reported in the 
literature for disadvantaged groups, we also calculated ES by 
subtracting the mean for SWoD (the combination of students in 
general education and students identified as AG) from the ob- 
served means for each SWD group and dividing by the observed 
SD of the scores for all students in that grade (Bloom et al., 2008). 


Results 


\ 


Multilevel Growth Models 


Unconditional and longitudinal level-1 models. We first 
applied a fully unconditional random effects model, estimating 
only grand means and variance components. We then estimated a 
two-level linear longitudinal model, followed by a quadratic 
growth model. We allowed each growth trajectory parameter to 
vary randomly across students. We found that the quadratic model 
resulted in a statistically significant improvement in model fit over 
a linear model (p < .001) and a multiparameter variance compo- 
nent test indicated that random effects provided a better fit to the 
data than a fixed effects model, x7(5) = 9434.34, p < .001. 

Results of the unconditional quadratic model are shown in the 
first columns of Table 3. For all students, the estimated mean 
reading comprehension score in Grade 3 was 247.80. The average 
initial linear change was 5.21 scale score points, which differed 
significantly from 0 (z = 357.86, SE = 0.01, p < .001). The 
curvature in the growth function was —0.44 scale score points, a 
value that also differed significantly from 0 (z = —128.89, SE = 
0.003, p < .001). The model parameter intercorrelations between 
intercept and linear, intercept and curvilinear, and linear and 
curvilinear parameters were —.54, .50, and —.88, respectively. 
Multilevel model parameter reliabilities were .86 for the intercept, 
.16 for the linear slope, and .10 for quadratic change. 

Model with sociodemographic characteristics. The next 
model added dummy coded predictors reflecting students’ so- 
ciodemographic characteristics, with students who were male, 
White, not receiving free lunch, and not classified as having 
limited proficiency in English comprising the reference group. 
With the introduction of the demographic variables, the intercept 
increased to 250.49, linear slope decreased slightly to 5.12, and 
curvature remained the same at —.44. As expected based on 
previous research (e.g., Morgan et al., 2011), students’ growth 
trajectories differed on the basis of sociodemographic characteris- 
tics, with the greatest differences occurring in terms of initial 
reading achievement. Students who had limited English profi- 
ciency, or were of Black, Hispanic, American Indian, or Multira- 
cial ethnicity had significantly lower initial intercepts than the 
reference group. Females and students of Asian ethnicity had 
significantly higher initial reading comprehension scores. In terms 
of linear growth and curvature, most sociodemographic character- 
istics were associated with significantly higher linear slope coef- 
ficients and significantly greater deceleration in growth. Compared 
with the unconditional growth model, the addition of sociodemo- 
graphic predictors resulted in a statistically significant reduction in 
unexplained variance, x7(24) = 27,638.93, p < .001. The demo- 
graphic only model accounted for 22.39% of the variance in the 
intercept, 3.18% of the variance in linear slope, and 4.13% of the 
variance in curvature. Intercorrelations among model parameters 
were as follows: —.55 between intercept and linear, .48 between 
intercept and curvilinear, and —.88 between linear and curvilinear. 

Model with sociodemographics and exceptionalities. In the 
final model, we added predictors for student exceptionality to the 
previous model. In this model, the values for the reference group 
(the first row of the right-most columns of Table 3) represent the 
average intercept, slope, and quadratic curvature parameters for 
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students who were in general education, male, White, not classi- 
fied as limited English proficient, and not eligible for free lunch. 

Adding exceptionality groups to the model resulted in little 
change in intercept (250.49 to 250.82), and slight decreases in 
linear slope (5.12 to 5.03) and curvature (—0.44 to —0.42) for the 
reference group. Each exceptionality group differed significantly 
from the reference group for the intercept, with students who were 
AG-R and AG-O scoring almost 7 or 8 scale score points above the 
reference group and the different SWD exceptionality groups 
scoring from 2.64 scale score points below the reference group for 
students with SLI to almost 15 scale points below the reference 
group for students with intellectual disabilities. In terms of linear 
slope and curvature, only two exceptionality groups (1.e., students 
with autism and hearing impairments) did not differ significantly 
from the reference group. Students who were AG-RD and AG-O 
had linear slope coefficients that were negative but both groups 
also showed curvature that was accelerated over that seen in the 
reference group. In all cases where a SWD subgroup had a linear 
slope that differed significantly from the reference group, the 
coefficient was positive but there was still significant deceleration 
for the group. 

Compared with the model that only included student sociode- 
mographic characteristics, the addition of the exceptionality pre- 
dictors resulted in a statistically significant reduction in unex- 
plained variance, x7(30) = 23598.93, p < .001; with the explained 
variance in intercept increasing about 17% to 39.44%, linear slope 
about 5% to 8.45%, and the variance accounted for in curvature 
doubling to 8.26%. Intercorrelations among model parameters in 
the final model were as follows: —.52 between intercept and 
linear, .46 between intercept and curvilinear, and —.88 between 
linear and curvilinear. 

The empirical Bayes (EB) estimated means and SDs from our 
final growth model by grade, and by exceptionality and grade are 
provided in Table 4 and depicted graphically in Figure 1. Although 
initial achievement by group varied substantially, in general the 
groups showed a similar curvilinear growth pattern across grades 
with more rapid growth up to Grade 5, followed by slower growth 
afterward. The exceptionality groups tended to maintain their 
relative position across grades, with students who were AG-R and 
AG-O consistently outperforming the GE reference group and 
showing less deceleration across grades. Of the SWD groups, 


Table 4 


students with SLI performed most similarly to GE students fol- 
lowed by students with LD-O. Students with autism, other health 
impairment) emotional disturbance, and hearing impairment were 
similar in terms of initial reading achievement and growth across 
the grades. Students who were LD-R showed a somewhat different 
growth trajectory, characterized by low initial achievement in 
Grade 3 but more rapid growth than several SWD groups, surpass- 
ing students with emotional disturbance and hearing impairments 
by 7th grade. 

Because linear and quadratic growth parameters separately do 
not denote the actual rate of change at a particular time point, we 
also calculated average rates of change for each group at each 
grade by combining slope and curvature coefficients (see Rauden- 
bush & Bryk, 2002, p. 171). At 3rd grade, the reference group had 
an initial growth rate of 5.03 scale score points. Students with 
LD-R had the highest initial growth rate (6.75 scale score points), 
followed by students with intellectual disabilities (6.41 scale score 
points) and students with emotional disturbance (6.02). Students 
who were AG-R and AG-O had the lowest initial grade growth 
rates (4.44 and 4.49, respectively). By Grade 7, the average growth 
rate for the reference group was 1.66 showing the deceleration of 
growth over grades. Students who were AG-R and AG-O now had 
the second and third highest growth rates (1.82 and 1.76), with 
students with emotional disturbance having the lowest average 
growth rate at Grade 7 (0.69 scale score points). The growth rate 
for students who were LD-R had slowed to 1.23 scale score points 
by 7th grade. 

Supplemental analysis. As noted above, students with LD-R 
had the highest initial growth rate of all SWD groups. To examine 
whether this growth rate differed significantly, we did a follow-up 
analysis substituting the LD-R subgroup for the GE students as the 
reference group, keeping all other reference group variables the 
same as those in the original model. All differences in intercept, 
linear slope, and quadratic curvature between the LD-R and the 
other SWD groups were statistically significant (p < .05) with a 
few exceptions: compared with LD-R students, students with hear- 
ing impairments did not differ on intercept or quadratic curvature, 
students with intellectual disabilities did not differ in slope or 
quadratic curvature, and students with emotional disturbance did 
not differ in curvature. 


Empirical Bayes Estimated Means and SDs (in Parentheses) From the Final Hierarchical Linear Model (HLM) Regression Model 





Student group 


All students 


General education 
Academically gifted, reading 
Academically gifted, other 


Autism 


Emotional disturbance 
Hearing impairment 
Intellectual disability 
Learning disability, reading 
Learning disability, other 
Other health impairment 
Speech-language impairment 


Grade 





3 


247.77 (8.22) 
248.14 (7.16) 
258.23 (4.28) 
256.77 (4.80) 
241.37 (9.23) 
238.97 (7.65) 
238.97 (8.00) 
230.81 (4.81) 
237.86 (7.82) 
242.95 (7.85) 
240.33 (7.69) 
245.85 (7.94) 


4 


252.56 (7.61) 
252.85 (6.65) 
262.33 (3.95) 
260.94 (4.49) 
246.41 (8.58) 
244.48 (7.11) 
244.34 (7.52) 
236.71 (4.48) 
244.05 (7.27) 
248.34 (7.31) 
245.51 (7.15) 
250.85 (7.40) 


5 


256.46 (7.31) 
256.71 (6.42) 
265.83 (3.84) 
264.46 (4.41) 
250.55 (8.28) 
248.55 (6.89) 
248.54 (7.33) 
241.19 (4.37) 
248.81 (7.04) 
252.49 (7.07) 
249.55 (6.93) 
254.89 (7.16) 


6 


259.48 (7.32) 
259.73 (6.45) 
268.73 (3.91) 
267.34 (4.51) 
253.80 (8.29) 
251.20 (6.95) 
251.57 (7.44) 
244.25 (4.43) 
252.15 (7.08) 
255.42 (7.12) 
252.46 (6.98) 
257.95 (7.20) 


7 


261.62 (7.61) 
261.90 (6.72) 
271.03 (4.12) 
269.56 (4.76) 
256.17 (8.59) 
252.41 (7.26) 
253.43 (7.81) 
245.88 (4.65) 
254.07 (7.38) 
257.11 (7.41) 
254.24 (7.28) 
260.04 (7.49) 
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Mean EB Reading Comprehension Score 





Figure 1. 


Growth Patterns and Achievement Gaps 


Our second research question concerned whether the pattern of 
individual differences related to students’ disability status (con- 
trolling for sociodemographic differences among groups) showed 
a fan-spread, fan-close, or stable growth pattern, and whether 
reading comprehension gaps closed for any SWD groups. Bast and 
Reitsma (1998) proposed three facets of results from longitudinal 
growth studies that are relevant for characterizing developmental 
patterns in individual differences: (a) stability of individual differ- 
ences over time, (b) stability of interindividual variance in the 
population, and (c) direction of the correlation between baseline 
level and growth (Bast & Reitsma, 1998). 

In terms of Bast and Reitsma’s (1998) first criterion, the differ- 
ences among exceptionality groups were quite stable across 
grades. The ordering of means for the exceptionality groups 
changed little from Grade 3 to 7, with the possible exception of the 
LD-R group, who in 3rd grade was ranked second from bottom and 
who by 7th grade had surpassed students with emotional distur- 
bance and hearing impairments. In terms of Bast and Reitsma’s 
second criterion, the SDs across grades for all students as group 
showed a small fan-close pattern, with the SD decreasing from 
8.22 in 3rd grade to 7.32 in 6th grade and 7.61 in 7th grade (see 
Table 4). With regard to the third criterion, the correlation between 
the student EB estimated intercepts and reading gains across 
Grades 3 to 7 (gain calculated by subtracting the estimated inter- 
cept at Grade 3 from the estimated score at Grade 7 for each 
student), the correlation was r = —.40 (p < .001). These results 
are consistent with a fan-close effect. 

To further examine how growth differed over time for students 
in the SWD, AG, and GE groups, we calculated achievement gaps 
at each grade for students in each exceptionality group using EB 
estimated means, and a second time using observed means, as 
detailed earlier in the analysis section. As indicated in the top half 


—¢— Academically gifted, reading 
_—6— Academically gifted, other 
—A— General education 
—t+— Learning disability, other 
—e— Autism 
—#— Other health impairment 
—<— Hearing impairment 
--&:- Learning disability, reading 
Intellectual disability 


Empirical Bayes (EB) estimated means from final model by grade and student group. 


of Table 5 and Figure 2, AG-R students scored about 1 SD above 
the reference group of GE students at each grade, with students 
who were AG-O consistently scoring above the reference group, 
although lower than the AG-R group. The achievement gaps for 
students in the different SWD exceptionality groups varied; stu- 
dents with SLI consistently had the smallest gaps, under a quarter 
of a SD, and students with intellectual disabilities consistently had 
the largest gaps, well over 1.5 SDs at each grade level. 

In most cases, when achievement gaps for SWD are calculated 

(e.g., U.S. Department of Education, Institute of Education Sci- 
ences, National Center for Education Statistics, 2014), the com- 
parison group is SWoD, with AG and GE students combined to 
form the comparison group, and means are not adjusted for dif- 
ferences in sociodemographic characteristics among groups. 
Therefore, we also calculated ESs across grades combining the GE 
and AG groups (see bottom half of Table 5). Achievement gaps for 
SWD were larger when the reference group was all SWoD, but the 
pattern of differences among exceptionalities groups remained the 
same. ; 
With either comparison group, no SWD group closed the 
achievement gap appreciably by 7th grade. Students with LD-R, 
the highest prevalence SWD exceptionality group, showed the 
largest relative achievement gain across the grades, with the ES 
narrowing from —1.09 in 3rd grade to —0.83 in 7th grade in 
comparison to GE students using EB estimated means, and 
from —1.16 to —0.92 in comparison to SWoD. However, for all 
SWD groups, the extent to which the achievement gap closed was 
small relative to the size of the initial achievement gap. 


Discussion 


Evaluating congruence between the results from descriptive 
studies of students’ reading achievement growth and theory-based 
predictions is one means of advancing reading theory and under- 
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standing its implications for practice. In addition, research on 
achievement growth is limited for groups of students, such as 
SWD and AG students, whose learning needs may differ from 
those of the general population. The primary purpose of the present 
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Table 5 
Model-Based and Observed Reading Comprehension Achievement Gap Effect Sizes 
by Exceptionality \ 











Grade 
Student group 3 4 5 6 y 
Using empirical Bayes (EB) estimated means 
Academically gifted, reading +1.07 +1.01 +297) +.95 +.97 
Academically gifted, other +,.92 +.86 +.82 +.81 +.81 
Autism Ss —.68 =e) 103. sO 
Emotional disturbance = 97 =.89 =O tol Oi 
Hearing impairment ON) —.90 Sell —.86 —.90 
Intellectual disability — 1.84 lay GS —1.64 170) 
Learning disability, reading —1/09 193! —.84 —.80 —.83 
Learning disability, other = — 48 —.45 —.46 Oi 
Other health impairment oS HS =O Sa [i =.81 
Speech-language impairment —.24 —=21 So lg) — 20) 
Using obtained means from analytical sample 
Autism —.66 = 13 0 ee —.54 
Emotional disturbance 1202) sue) — 1.04 = 1:07 1:09 
Hearing impairment OL = eOD — 1.06 93 —1.00 
Intellectual disability —1.90 — 189 —1.88 —1.89 pel 
Learning disability, reading Ee {alii =.98 oO = (92 
Learning disability, other —.64 61 200) =O ol 
Other health impairment —.90 93 89 = 90) 285 
Speech-language impairment —.34 Oe ill oO) =.30 


Note. Effect sizes in top half of table were calculated by subtracting EB estimated mean for the students in 
general education from the EB mean for each exceptionality group and dividing the estimated group difference 
by the square root of the sum of the level-1 and level-2 model variance components. Effect sizes in the bottom 
half of table calculated by subtracting the observed mean for all students without disabilities from each 
exceptionality group observed mean and dividing by the observed SD of the scores for all students in that grade. 








g —¢e— Academically gifted, reading 
n 
3 | —©— Academically gifted, other 
. Ae i --O— Speech-language impairment 
S ; —#— Other health impairment 
o i ——_t+— Learning disabiity, other 
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a —<— Hearing impairment 
$ -1.00 —©— Emotional disturbance 
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®O i 
we 

3 4 S 6 if 

Grade 


Figure 2. Reading achievement gap effect sizes based on differences in empirical Bayes estimated means 
across grades for students in different exceptionality categories compared with students in general education. 


study was to add to the empirical knowledge about reading com- 
prehension achievement growth for SWD. Innovative design fea- 
tures of the study included examination of growth for SWD by 
specific exceptionality categories; examination of growth for AG 
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students; inclusion of a GE reference group; use of an operational 
state test of reading comprehension; and the calculation of ES as 
an empirical means of examining changes in the SWD achieve- 
ment gap across grades. 


Initial Differences and Reading Comprehension 
Achievement Growth Across Grades 


With regard to our first research question concerning the devel- 
opmental progress in reading comprehension for students in spe- 
cific exceptionality groups, we found that the pattern of growth 
over grades observed in previous studies of the general student 
population (e.g., Lee, 2010; Rescorla & Rosenthal, 2004) also 
characterized growth in the 10 exceptionality groups examined. 
Developmental progress in reading comprehension across Grades 
3 to 7 was best represented as a curvilinear function. Students 
made larger gains in the early grades that decelerated as students 
transitioned across grades. This finding is consistent with the 
simple view of reading or a convergent skills model of reading 
comprehension development (Gough & Tunmer, 1986; Vellutino 
et al., 2007). As readers’ word recognition skills become more 
fully developed, allowing them to decode text near or at their 
listening comprehension level, skills in language comprehension 
become primary contributors to reading comprehension growth 
(Garcia & Cain, 2013; Tighe & Schatschneider, 2014; Vellutino et 
al., 2007), resulting in an overall slowing of reading comprehen- 
sion growth. If we assume that students who are identified as 
AG-R in 3rd grade already had well developed word recognition 
skills, this interpretation could also account for their lower initial 
growth rates in 3rd grade and higher growth rates relative to most 
other exceptionality groups in 7th grade. 

Although the overall shape of reading comprehension achieve- 
ment growth across grades resembled that found in studies of the 
general student population, each SWD exceptionality group had 
significantly lower initial achievement in 3rd grade compared with 
the reference group. Most SWD exceptionality groups also showed 
small, but statistically significant departures from the comparison 
group of GE students in both linear growth and quadratic curva- 
ture, with higher linear growth, but more deceleration across 
grades. Students identified as AG-R and AG-O had higher initial 
reading comprehension achievement, but lower linear growth and 
less deceleration across grades. 

The lower initial reading achievement and curvilinear growth 
observed for SWD in the present study has been found in studies 
of reading growth for students with LD and SLI (e.g., Francis et 
al., 1996; Judge & Bell, 2010; Morgan et al., 2011). Similar to the 
one study that examined multiple exceptionalities other than LD 
and SLI (Wei et al., 2011), we observed considerable heterogene- 
ity in intercept by exceptionality. On a test where the average GE 
student annual growth was approximately five scale score points in 
the 3rd grade, differences in intercept between GE students and 
SWD in the present study ranged from less than three scale score 
points for students with SLI to almost 18 points for students with 
intellectual disabilities. Although Wei et al. examined differences 
in SWD intercepts when children were about 3.5 years older than 
students in the present study (approximately 9.3 vs. 12.7 years of 
age for Wei et al.), the rank ordering of exceptionality groups was 
similar. Students with intellectual disabilities had the lowest read- 
ing achievement level and students with SLI had the highest level 


of the SWD groups. Students with autism, other health impair- 
ment, LD, or emotional disturbance showed initial achievement 
levels similar to one another that fell between the least and most 
impaired SWD groups. 

Morgan et al. (2011) found that students with SLI ranked lower 
in reading achievement at the end of Ist grade than students with 
LD, a finding at odds with the present study and Wei et al. (2011). 
Longitudinal studies of changes in exceptionality classification 
across grades suggest that many students initially identified with 
SLI in preschool are later identified as having LD (Delgado, 2009). 
It may be that students who are identified as SLI in preschool or 
kindergarten who later show significant impairment in reading are 
then reclassified as LD, resulting in a change in relative reading 
achievement for the two exceptionality groups. We were unable to 
obtain special education classifications for students before 3rd 
grade to examine whether some of students in the LD-R group had 
previously been identified as SLI, but did confirm that the small 
number of students who were SLI in 3rd grade that were later 
classified as LD-R had reading comprehension trajectories that 
more closely tracked the LD-R group. 

The finding in the present study that students with autism ranked 
higher than students with LD-R is a reversal of the rankings 
reported by Wei et al. (2011). The difference in relative standing 
between the exceptionalities is likely because of differences in the 
subset of students with autism included in each study. In the 
present study, over half the students with autism had consistently 
participated in an alternate reading assessment rather than the 
general assessment, and were not included in the study. Although 
Wei et al. excluded some students with cognitive impairments 
from their sample, their reading assessment was designed for a 
wider range of ability and age (three to adulthood), and it is likely 
their study included students with more severe cognitive impair- 
ments than the students with autism in the present study. 


Developmental Pattern for Reading Comprehension: 
Stable, Widening, or Narrowing? 


Our second research question concerned the developmental 
growth pattern observed for reading comprehension skills and its 
impact on reading comprehension achievement gaps for SWD 
across grades. Taken in their totality, our findings were most 
consistent with a stable differences developmental pattern. For 
SWD, we found no evidence of a fan-spread pattern or Matthew 
effect relative to the criteria proposed by previous researchers 
(Bast & Reitsma, 1998; Pfost et al., 2014). Although some of our 
findings could be viewed as supportive of a fan-close pattern in the 
general population (i.e., the small decrease in total variance across 
time and moderate negative correlation between intercept and total 
gain), achievement gaps for SWD changed very little across 
grades, and group rankings for the exceptionality groups also 
remained stable. After 4 years, none of the SWD groups had 
“caught up” with the students in GE in their reading comprehen- 
sion achievement. One possible exception to the stable differences 
pattern was the .26 SD unit decrease from Grade 3 to 7 in the 
achievement gap ES for students with LD-R. Taken as a percent of 
the initial achievement gap (ES = —1.09), this represented a 24% 
decrease in the achievement gap for students with LD-R. 

It is difficult to evaluate the consistency of the present findings 
with previous research given the lack of longitudinal studies that 
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have examined exceptionality groups other than LD and SLI, and 
the differences in methodology and age span studied within the 
existing corpus of studies. For SLI students, our results are con- 
sistent with Catts et al.’s (2008) characterization of differences for 
this group as stable across grades, but in conflict with the results 
of Morgan et al. (2011), who found an increasing achievement gap 
in reading for this group. Our results concerning changes in the 
achievement gap across grades for students who are LD-R were 
more positive than the widening achievement gap for students with 
LD reported by Judge and Bell (2010), and the stable gap reported 
by Morgan et al. (2011). However, neither of those studies sepa- 
rated out students with LD in reading versus other academic areas, 
and our results would more closely track Morgan et al.’s if our two 
LD groups had been combined. Our finding that students with 
LD-R differed from students with LD-O in terms of intercept and 
growth suggests that studies combining these two groups (e.g., 
Judge & Bell, 2010; Wei et al., 2011) may not yield results that are 
representative of either group, and may cloud the possible differ- 
ential response of the two groups to school-based reading inter- 
ventions. 

As noted earlier, several researchers have suggested that differ- 
ences in findings across studies relative to developmental patterns 
of growth in reading may be attributable to the (a) grade at which 
initial status is determined, (b) the grade span examined, or (c) the 
reading skill assessed (Kieffer, 2012; Morgan et al., 2011; Pfost et 
al., 2014). For example, Judge and Bell (2010) and Morgan et al. 
(2011) used the same dataset (the Early Childhood Longitudinal 
Study-Kindergarten Cohort, ECLS-K), grade span, and composite 
reading measure, but placed their intercepts at kindergarten entry 
(Judge & Bell, 2010) and the end of 1st grade (Morgan et al., 
2011). They found marked differences in the correlation between 
intercept and slope (.35 vs. —.07). This suggests that the choice of 
intercept location is critical to the overall growth pattern observed, 
and our placement of the initial intercept at Grade 3 may explain 
our moderate and negative correlation between initial intercept and 
total gain. On the other hand, it is unlikely that reading compre- 
hension was weighted heavily, if at all, at Grades K and 1 in the 
ECLS-K composite assessment, and the present study used a 
measure of reading comprehension. The differences in types of 
reading skills assessed may have contributed to discrepant findings 
across studies. It should also be noted that patterns of growth and 
the size of achievement differences may be affected by artifactual 
differences in the vertical scale from one assessment to another 
that can impact functional form or the equality of score intervals 
over time (Bolt et al., 2014; Briggs & Weeks, 2009). 


Potential Explanations for the Stable 
Differences Pattern 


The correlational nature of the present study, and lack of infor- 
mation about home and school literacy influences or cognitive and 
behavioral characteristics related to reading, except as they are 
represented by the different exceptionality classifications, limit 
any conclusions about the mechanisms underlying the observed 
stable differences growth pattern. One hypothesis that fits the 
study’s overall pattern of results is that SWD in the present study 
were receiving instruction in general and special education that 
emphasized the development of word recognition skills—a likely 
culprit when most children experience early reading difficulties 


(Rayner, Foorman, Perfetti, Pesetsky & Seidenberg, 2001). If so, 
then that instruction may have produced sufficient gains in early 
word recognition skills to allow the rapid growth in reading 
comprehension seen as readers initially develop a corpus of sight 
words (Ehri, 2005), but it did not result in growth in the skills that 
underlie age-appropriate reading comprehension as language com- 
prehension skills versus word identification skills become ascen- 
dant as determinants of reading comprehension for typically de- 
veloping readers (Scarborough, 2001; Tighe & Schatschneider, 
2014). As such, SWD showed an increased rate of growth as the 
initial bottleneck in their reading comprehension skill development 
was removed, but the instruction did not address needs that were 
“hidden” by the students’ initial decoding deficits, such as deficits 
in background knowledge, making inferences, constructing mean- 
ing from text, or failure to develop sophisticated context- 
dependent word identification skills for word recognition (Comp- 
ton et al., 2014; Oakhill & Cain, 2012; Perfetti & Stafura, 2014). 

Supporting this view is the finding that the group of students 
with LD-R, the SWD group most likely to have a specific and 
marked deficit in word recognition, had an initial growth rate that 
differed significantly from all but one of the SWD groups, and was 
the only group where the achievement gap showed some narrow- 
ing. Other evidence supporting this view (although much less 
direct) is that the time period when students were receiving special 
education services in the present study corresponded with a state- 
wide professional development initiative in North Carolina aimed 
at improving the teaching of reading foundations for SWD 
(NCDPI, 2007). This interpretation is also consistent with findings 
from a recent review of observational studies of special education 
instruction (McKenna, Shin, & Ciullo, 2015) that indicated that 
phonics instruction accounted for a substantial portion of time in 
special education settings with limited time spent teaching reading 
comprehension. 

Baumert et al. (2012) postulated that a stable differences pattern 
for reading growth could result from competing mechanisms in 
reading development that produce simultaneous fan-close and fan- 
spread patterns. An explanation for the present study’s results 
consistent with that hypothesis is that schools differentially allo- 
cate resources to assure that low-achieving children reach grade- 
level proficiency and these actions produce a fan-close pattern for 
schooling that is countered by many mechanisms within reading 
development that produce a fan-spread pattern (Stanovich, 1986). 
This interpretation is supported by research showing the cumula- 
tive impact of summer and after school experiences on achieve- 
ment growth (Alexander, Entwisle, & Olson, 2007; Pfost, Dérfler, 
& Artelt, 2013) and the pattern of greater gains in summer for 
students who are AG compared with reading gains during the 
school year (Rambo-Hernandez & McCoach, 2015). 


Study Limitations 


Although this study has contributed new evidence on the read- 
ing achievement growth of SWD, these findings should be inter- 
preted in light of several limitations. Specifically, the present study 
only included SWD who participated in the general reading as- 
sessment at some time during Grades 3 to 7. SWD with severe 
cognitive impairments, most prevalent in the exceptionalities of 
intellectual disability and autism, did not take the general assess- 
ment and were not represented in the analytic sample. Therefore, 
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our results should be considered representative of reading growth 
only for students participating in the general education assessment. 
A second important issue relative to the SWD group is that we 
based exceptionality category membership on students’ 3rd grade 
primary special education classification. Some students in special 
education are served in two or more exceptionality categories and 
we did not have information on how many students had multiple 
exceptionality classifications or comorbid conditions, such as at- 
tention-deficit/hyperactivity disorder, that are prevalent across cat- 
egories (Blackorby et al., 2005). Reading achievement trajectories 
for students with comorbid disabilities may differ from the out- 
comes reported by primary disability only. Students enter and exit 
special education throughout their educational careers (Ysseldyke 
& Bielinski, 2002), change exceptionality classifications (Black- 
orby et al., 2005), and move in and out of the alternate and general 
assessment, and these changes were not represented in the present 
study. Depictions of achievement growth for students who were 
consistently in special education across the grade span of the study 
or who entered special education after 3rd grade may be different 
(Schulte & Stevens, 2015). 

We had no information about students’ educational and home 
environments, or their motivation to read, independent reading, or 
attention and engagement in school, although each of these factors 
is related to reading comprehension growth (Guthrie et al., 2007; 
Morgan et al., 2011; Wei et al., 2011). We also lack external 
validation of the exceptionalities and how closely school identifi- 
cation of exceptionality matches with descriptions of different 
disabilities in the research literature. Another limitation of the 
present study is the use of large scale reading assessment data from 
only one state. Eligibility criteria, prevalence, and characteristics 
of children receiving special education differ by state (U.S. De- 
partment of Education, Institute of Education Sciences, National 
Center for Education Statistics, 2015), as do the content and format 
of state reading assessments (May, Perez-Johnson, Haimson, Sat- 
tar, & Gleason, 2009). The NC state reading assessment focused 
exclusively on comprehension of connected text and did not in- 
clude items specifically designed to assess reading vocabulary, 
decoding, or other reading skills which may show different pat- 
terns of individual differences in reading growth (Pfost et al., 
2014). 

Finally, although we had the advantage of using a database that 
permitted tracking of students who moved from one public school 
or school district within the state, we lost data for students who 
moved out of state or entered private schools. Our losses (about 
4% per year) were less than observed in most longitudinal studies 
(e.g., Choi, Seltzer, Herman, & Yamashiro, 2007), but attrition still 
may have had a measurable effect on estimation of growth or ES. 


Implications and Conclusions 


The present study adds to the growing body of research about 
reading growth in SWD and the limited research on achievement 
growth for students who are AG. Cross sectional depictions of the 
SWD achievement gap have led to concerns that SWD are falling 
further behind SWoD at each grade (Morgan et al., 2008; Vaughn 
& Wanzek, 2014). Our results suggest that stable differences or a 
slight fan-close pattern more accurately describes the pattern of 
individual differences between SWD and AG students relative to 
GE students across 3rd to 7th grade. Although this finding is more 


positive than some current portrayals of the SWD achievement 
gap, it should in no way minimize the implications of the substan- 
tial gaps present at 3rd grade and beyond for SWD. 

Why did the more rapid growth for SWD fail to result in a 
narrowing of the achievement gap, with the exception of a small 
amount of closure in the gap for students who were LD-R? We 
speculated that the more rapid growth observed for SWD could be 
attributed to an increased focus on instruction in word recognition 
in special education as a result of national and state policy initia- 
tives. Removing the initial “bottleneck” in word recognition for 
children who were, on average, acquiring these skills somewhat 
later than SWoD, had a lasting impact only for the group where 
this deficit was most likely to be primary. If this interpretation is 
correct, it suggests that early assessment and intervention directed 
toward oral language comprehension and vocabulary for students 
with disabilities who show deficits in these areas may be a means 
of sustaining the more rapid initial growth in reading comprehen- 
sion for SWD observed in the present study (e.g., Clarke, Snowl- 
ing, Truelove, & Hulme, 2010). 

In terms of policy implications, our findings reinforce the con- 
cerns expressed by others (e.g., Wei et al., 2011) that one-size- 
fits-all achievement expectations ignore the magnitude and com- 
plexity of the differences in reading comprehension achievement 
for SWD already present in the 3rd grade. Closing some of the 
larger SWD achievement gaps, even over the course of several 
grades, would require reading growth rates with much greater 
acceleration than those observed in the present study and higher 
than those found in studies with intensive, multiyear interventions 
for students with reading difficulties (e.g., Allor, Mathes, Roberts, 
Cheatham, & Otaiba, 2014; Roberts et al., 2013). Our findings do 
suggest that accountability models that set similar expectations for 
achievement growth for SWD and SWoD may more closely match 
data than growth to proficiency models that require much more 
accelerated growth for SWD. 

NCLB’s (2001) goal of all students reaching grade-level profi- 
ciency within a uniform timeframe was set in the context of a 
limited body of research about students’ developmental patterns of 
achievement growth. This lack of information was particularly 
acute for SWD who had yet to be fully included in many national 
and state assessments, but also extended to other groups such as 
students who are gifted. The implementation of NCLB, with its 
annual assessments of students, has resulted in rich data sources 
for understanding student achievement growth, but these data 
spotlight the challenges in achieving its ultimate goal. Although 
our results do not indicate increasing achievement gaps, the stable 
differences or slight fan-close pattern observed indicate that, on an 
assessment designed to monitor NCLB progress, the goal of clos- 
ing the achievement gap for SWD has yet to be met. 
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As mathematics standards press for algebra instruction in ele- 
mentary school (National Council of Teachers of Mathematics, 
2006; National Governors Association Center for Best Practices & 
Council of Chief State School Officers, 2010), and as children are 
expected to engage in reasoning about mathematics as early as 
preschool (National Council of Teachers of Mathematics, 2014), it 
is necessary to understand the prealgebraic reasoning of children in 
the elementary grades. In the present study, we evaluated the 
equation-solving performance of first- and second-grade children 
to learn how children apply knowledge of arithmetic (i.e., number 
and operations) to prealgebra. This investigation informs research- 
ers and educators about the current ability of young children to 
engage in prealgebraic reasoning and provides a framework for 
supporting the transition from arithmetic to prealgebra. 


Arithmetic and Algebra 


Several mathematicians have described arithmetic and algebra 
as separate entities. For example, Herscovics and Linchevski 
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(1994) described a cognitive gap between arithmetic and algebra. 
According to Herscovics and Linchevski, this gap exists because 
children cannot spontaneously work with an unknown using only 
operational knowledge. Filloy and Rojano (1989) used cut to 
define the divide between arithmetic and algebra. Similar to Her- 
scovics and Linchevski, Filloy and Rojano expressed operating on 
an unknown as necessary for demonstration of algebraic thinking, 
and children have difficulty developing this skill with only an 
initial understanding of operations and arithmetic (Filloy & Ro- 
jano, 1989). 

Rather than describing arithmetic and algebra as separated by a 
cut or gap, Pillay, Wilss, and Boulton-Lewis (1998) described the 
divide as a sequence. Pillay et al. confirmed that a gap exists 
between arithmetic and algebra, and because of this gap, it is 
necessary to develop the prealgebraic knowledge of children. 
Pillay et al.’s three-stage model illustrates arithmetic leading to 
algebra with prealgebra as the second stage linking arithmetic and 
algebra. With arithmetic, children focus on numbers and numerical 
procedures. Prealgebra involves an understanding of the equal sign 
and solving equations with one unknown (Carraher & Schliemann, 
2007; Pillay et al., 1998). In the present study, we define prealge- 
bra according to Pillay et al.’s definition. Finally, algebra com- 
prises solving for more than one unknown. As arithmetic is nec- 
essary for competence with algebra (Boulton-Lewis, Cooper, 
Atweh, Pillay, & Wilss, 2000), and as children without strong 
arithmetic skill will likely have difficulty with algebra, prealgebra 
acts as the intermediary agent between arithmetic and algebra to 
adequately prepare children for the rigors of algebra. 

In a related way, Carraher, Schliemann, Brizuela, and Earnest 
(2006) explained arithmetic and algebra as not distinct from one 
another. In fact, algebra utilizes arithmetic and acts as an extension 
of arithmetic (Britt & Irwin, 2008). Kilpatrick, Swafford, and 
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Findell (2001) described a transition from arithmetic to algebra in 
which children learn another approach (i.e., algebra) to solving 
equations. Kieran (2004) outlined the transition from arithmetic to 
algebra as involving a focus on the relations in the equation, 
inverse operations and how operations can be undone, the un- 
known in the equation, and a refocus on the interpretation of the 
equal sign. Kieran’s description of the arithmetic to algebra tran- 
sition closely aligns with the three-stage model of Pillay et al. 
(1998). 

We use Pillay et al.’s (1998) arithmetic to prealgebra to algebra 
three-stage model to frame the present study to learn how preal- 
gebra acts as the stage at which children develop (spontaneously or 
with instruction) an understanding of operating with or on an 
unknown. Given that competence with algebra is necessary for 
success in high school and beyond (Spielhagen, 2006), and be- 
cause of the push for algebraic reasoning in elementary and middle 
school (e.g., National Council of Teachers of Mathematics, 2006), 
it is necessary to understand how young children navigate the 
arithmetic to prealgebra transition and how prealgebraic under- 
standing develops in children. The link between arithmetic and 
algebra has been established with older children (e.g., Fuchs et al., 
2012), and we extend this work to children as young as first grade. 
Kaput (1998) explained that the key to improving algebra skill in 
the later grades is the incorporation of algebra across the contin- 
uum of grade levels, and Smith and Thompson (2007) character- 
ized current elementary curricula as inadequate for the develop- 
ment of prealgebra and algebra understanding. Importantly, Byrd, 
McNeil, Chesney, and Matthews (2015) demonstrated that arith- 
metic interpretations of equations lead to lower performance on 
prealgebraic tasks than algebraic interpretations, such as those 
outlined by Kieran (2004). We conducted the present study to learn 
how children use arithmetic skill for prealgebraic reasoning and 
which characteristics of prealgebra are difficult for young children. 
This research may provide educators with a framework for early 
prealgebra instruction. 


Prealgebraic Reasoning 


At the elementary level, researchers and educators ask children 
to solve mathematical equations with one unknown (Clo ey, aes) 
to quantify prealgebraic reasoning (e.g., McNeil, Fyfe, Petersen, 
Dunwiddie, & Brletic-Shipley, 2011; Stephens et al., 2013; 
Weaver, 1973). By solving an equation, a child demonstrates a 
conceptual understanding of balance between two sides of an 
equation. This, in turn, demonstrates fluidity with several algebraic 
principles, including a manipulation of symbols, a study of rela- 
tions, and modeling (Jacobs, Franke, Carpenter, Levi, & Battey, 
2007; Kaput, 1998; Kieran, 2004). The expectation for prealge- 
braic reasoning emerges in first grade when standards outline that 
children should be able to solve addition and subtraction problems 
with the “unknowns in all positions” (National Governors Asso- 
ciation Center for Best Practices & Council of Chief State School 
Officers, 2010, p. 15). Children who demonstrate advanced 
equation-solving performance reveal stronger prealgebraic reason- 
ing with functions and overall mathematics competence (Carraher 
et al., 2006; Powell & Fuchs, 2014). 

Over the last several decades, researchers have demonstrated 
that many children experience difficulty solving addition and 
subtraction prealgebraic equations (de Corte & Verschaffel, 1981; 


Molina & Ambrose, 2008). Attention has turned toward the sym- 
bols used in equations as one of the possible reasons for differen- 
tial performance on prealgebraic tasks such as solving equations 
(e.g., Raghubar et al., 2009). Researchers have learned that a 
majority of children have difficulty with the relational symbol 
called the equal sign (““=”; Capraro, Capraro, Ding, & Li, 2007; 
Falkner, Levi, & Carpenter, 1999; Rittle-Johnson & Alibali, 1999), 
and equal sign interpretation has an impact on algebraic thinking 
(Kieran, 1992). Many children interpret the equal sign as “do 
something” or “write the answer” instead of interpreting the equal 
sign as a balance between two sides of an equation (Asquith, 
Stephens, Knuth, & Alibali, 2007; Sherman & Bisanz, 2009). This 
is especially true for children in the United States because of 
language associated with the equal sign (i.e., “equals” is an am- 
biguous term) and a lack of proper instruction on the symbol 
(Capraro et al., 2010; National Mathematics Advisory Panel, 
2008), and this misunderstanding begins in the elementary grades 
and persists through middle school and high school (Alibali, 
Knuth, Hattikudur, McNeil, & Stephens, 2007). 


Equations 


Prealgebraic reasoning with mathematical equations may be 
presented in several forms. Table 1 provides descriptions and 
examples of different types of equations. Children may solve 
standard equations, for which the equal sign is in a standard 
position and an operator symbol is on the left side of the equal 
sign. Standard equations are the most common equation type used 
in elementary curricula and are featured in mathematics textbooks 
more than 90% of the time (Powell, 2012). Overexposure to 
standard equations promotes a “mindlessness” in which children 
solve problems without actively engaging in prealgebraic thought 
about aspects of the problem or strategies necessary to solve the 
problem (McNeil, 2008, p. 1534). That is, children rely solely on 
arithmetic skill instead of prealgebraic reasoning to solve standard 
equations. 

Children may solve nonstandard equations in which the equal 
sign is in an atypical position in the equation. Nonstandard equa- 
tions can be identity statements, indicating a number equals itself. 
Another type of nonstandard equation features the operator symbol 
on the right side of the equal sign. Most often, operation-right-side 
equations feature a place in the equation for three values, so these 
equations are visually the most similar to standard equations, 
which also have a place in the equation for three values. Nonstan- 
dard equations can also have operations on both sides of the equal 
sign (McNeil & Alibali, 2004). These equations often have a place 
for four or more values. Operation-both-sides equations may be 
used to demonstrate properties of mathematics, such as the com- 
mutative property (e.g., Jacobs et al., 2007). Operation-both-sides 
equations have been widely used in the fields of education and 
psychology to determine how quickly children can change mis- 
conceptions (e.g., the equal sign as an operational symbol; Alibali, 
1999; Matthews & Rittle-Johnson, 2009; Perry, 1991) and to 
compare the equal-sign performance of children in different coun- 
tries (e.g., Li, Ding, Capraro, & Capraro, 2008). 

de Corte and Verschaffel (1981) described three dimensions that 
contribute to difficulty with solving equations for young children. 
First, the arithmetic operation (i.e., addition or subtraction) may 
impact performance. Second, whether the equation is standard or 
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Table 1 
Types of Equations 
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nonstandard may lead to performance differences. Third, the po- 
sition of the unknown may influence response. In previous work, 
Weaver (1973) demonstrated performance differences on standard 
and operation-right-side nonstandard equations by assessing first-, 
second-, and third-grade children. Results from Weaver’s work 
corroborate de Corte and Verschaffel’s three dimensions of preal- 
gebraic difficulty. First, children demonstrated better performance 
on addition equations than subtraction equations. Second, children 
exhibited greater success solving standard equations than nonstan- 
dard equations. Third, the most difficult equations for children 
were those with the unknown in an unconventional place in an 
equation (e.g.,__—b=corc = __—b). 

More recently, researchers have assessed the performance of 
elementary and middle schoolchildren on nonstandard operation- 
both-sides equations. McNeil and Alibali (2004) determined 
fourth-grade children make errors such as adding all and adding to 
the equal sign for an operation-both-sides problem such as 3 + 
4+5=__ +5. Children experienced even more frustration with 
operation-both-sides nonstandard equations with the unknown as 
the final part of the equation (e.g.,3 + 4 +5 = 3 + __). Whereas 
McNeil and Alibali only presented children with addition equa- 
tions, Molina and Ambrose (2006) presented third-grade children 
with both addition and subtraction nonstandard operation-right- 
side and operation-both-sides equations. Regardless of operation, 
children rarely interpreted the equal sign as relational. Interest- 
ingly, McNeil (2007) assessed children Ages 7, 9, and 11 on 
nonstandard operation-both-sides equations. Children at Ages 7 
and 11 performed better, but not well, on solving nonstandard 
equations than the 9-year-old children. This indicates variation in 
performance during the elementary grades. 


Connection with Arithmetic 


Across most of the work in the elementary grades related to 
prealgebraic equations, addition and subtraction operations have 
been the primary focus (e.g., Capraro et al., 2007; McNeil & 
Alibali, 2004; Molina & Ambrose, 2008; Powell, Driver, & Julian, 
2015). In terms of arithmetic, children typically develop an under- 
standing of part-part-whole relationships (i.e., addition) before 
understanding taking away from a whole (i.e., subtraction; Canobi, 


2005). Part-part-whole involves amounts (i.e., parts) added to- 
gether for a total (i.e., whole), whereas taking away involves 
taking the subtrahend amount away from the minuend. In fact, and 
in line with de Corte and Verschaffel’s (1981) first dimension, 
children often demonstrate better skill with addition than subtrac- 
tion (Baroody, 1999), and many children use addition knowledge 
to solve subtraction problems (e.g., counting on to solve a sub- 
traction problem) because individual addition skill is more effi- 
cient (Peters, De Smedt, Torbeyns, Verschaffel, & Ghesquiére, 
2014). This differential performance based on arithmetic skill 
indicates that elementary children may perform better on standard 
and nonstandard equations for which addition skill can be used 
rather than subtraction skill. Other factors, such as the complexity 
of arithmetic (i.e., unknown in various positions in the equation; 
greater counting skill required) and the complexity of equation 
(i.e., standard vs. nonstandard), may influence prealgebraic rea- 
soning. In this study, we investigate arithmetic and the properties 
of equations that influence prealgebraic reasoning. 


Purpose of the Present Study and Research Questions 


As equations are used to assess the prealgebraic reasoning of 
elementary children, it is important to understand which, if any, 
performance differences exist based on arithmetic, properties of 
equations, and other variables. This information may inform pre- 
algebraic instruction and assessment in the elementary grades. This 
information may also add support to Pillay et al.’s (1998) model 
connecting arithmetic to algebra, with prealgebra as the link be- 
tween the two. In the present study, we examined whether there 
were differences in performance on prealgebra tasks (i.e., standard 
and nonstandard equations) with addition and subtraction operator 
symbols by grade and season of administration. We asked the 
following research questions: 


How does prealgebra performance differ across three cohorts: 
first-grade children assessed in the spring, second-grade chil- 
dren assessed in the fall, and second-grade children assessed 
in the spring? We explored this question to understand the 
development of prealgebraic understanding across first and 
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second grade. We hypothesized that child performance would 
differ as a function of cohort. 


Does children’s arithmetic fluency predict prealgebra perfor- 
mance? We asked this question to understand the connection 
between arithmetic and prealgebra, as outlined by Pillay et al. 
(1998). We examined two arithmetic fluency skills important 
in early mathematics and appropriate for first and second 
grade: addition fluency and subtraction fluency (National 
Council of Teachers of Mathematics, 2006). Based on prior 
research (Peters et al., 2014), we hypothesized that children 
with strong arithmetic fluency would solve more equations 
successfully because limited arithmetic understanding can 
hinder algebraic reasoning (Banerjee, 2011). 


We proposed the following questions as secondary to the anal- 
ysis. These research questions were exploratory in nature and 
included to understand child- and item-level differences in the 
arithmetic to prealgebra sequence: 


What prealgebra item characteristics (a change of three or 
more number spaces required, nonstandard equation type, 
operation-both-sides, opposite operation required, and sub- 
traction operation shown) relate to children’s prealgebraic 
performance? 


Do hypothesized interactions relate to children’s prealgebraic 
performance? We tested the following specific interactions: 
(a) Addition Fluency < Nonstandard Type, (b) Addition 
Fluency X Subtraction Operation, and (c) Subtraction Oper- 
ation X Change of Three or More Number Line Spaces. 


What, if any, performance differences exist within the three 
cohorts based on demographic variables? We considered 
whether race/ethnicity, gender, special education status, re- 
tention status, and English learner (EL) status explained dif- 
ferences in prealgebraic reasoning. 


Method 


Participants 


First- and second-grade participants (V = 1,796) were sampled 
from 112 classrooms in 19 schools in two school districts in the 
mid-Atlantic region of the United States. We assessed three co- 
horts of children: (a) first-grade children assessed in the spring of 
2013 (i.e., Grade 1 Spring; n = 805 children from 52 classrooms 
in 18 schools), (b) second-grade children assessed in the fall of 
2012 (.e., Grade 2 Fall; n = 489 children from 31 classrooms in 
11 schools), and (b) second-grade children assessed in the spring 
of 2012 (i.e., Grade 2 Spring; n = 502 children from 30 classrooms 
in 10 schools). The study is cross-sectional; therefore, no child was 
tested twice. The mathematics instruction in both school districts 
was guided by state standards (www.doe.virginia.gov/testing/sol/ 
standards_docs/), and teachers in the school districts used either 
Math Expressions by Houghton Mifflin or Math Connects by 
Macmillan/McGraw-Hill to guide instruction. 

We gathered demographic information (i.e., gender, race, spe- 
cial education status, EL status, and retained status) for all partic- 


ipants. Demographic characteristics of the children in the sample 
are presented in Table 2. The cohorts differed in terms of race/ 
ethnicity, Special education status, retention status, and EL status. 
The Grade 2 Fall cohort had more Caucasian children, children 
with special education status, and children who had been retained, 
than the other two cohorts. The Grade 1 Spring cohort had a larger 
proportion of EL children than the other two cohorts. The three 
cohorts also differed, predictably, in their arithmetic fluency, with 
higher scores for each successive cohort. In the analysis, Grade 2 
Fall was used as a reference category, making it easy to observe 
differences based on the assessment period. 


Child Measures 


We assessed the mathematics performance of all children on 
three measures: Addition Fluency, Subtraction Fluency, and Open 
Equations. For Addition Fluency (Fuchs, Hamlett, & Powell, 
2003), children had 1 min to answer 25 vertically presented addi- 
tion facts with single-digit addends and sums to 12. The examiner 
read the directions aloud and then allowed children to work inde- 
pendently. The maximum score was 25. The coefficient alpha 
(Cronbach’s alpha) for this sample was .95. For Subtraction Flu- 
ency (Fuchs et al., 2003), children had 1 min to answer 25 
vertically presented subtraction facts with single-digit subtrahends 
and minuends to 12. After reading the directions aloud, the exam- 
iner allowed children to work independently. The maximum score 
was 25, and the coefficient alpha (Cronbach’s alpha) for the 
sample was .91. Both addition fluency and subtraction fluency 
demonstrate reliability across the elementary grades and act as 
strong predictors of overall mathematics achievement (e.g., Pow- 
ell, Fuchs, et al., 2015). 

For Open Equations (Powell, 2007), children had 8 min to solve 
30 horizontally presented open equations (see Figure 1 for the 
Open Equations measure). Each equation was presented with one 
blank (e.g., 2 = 7 - __), and children wrote a number on the blank. 
All equations used single-digit numbers, and no sum or minuend 
was greater than nine. Open Equations comprised 10 standard 
equations (i.e., operation-left-side) and 20 nonstandard equations. 
Of the nonstandard equations, children solved two identity state- 
ments (e.g., ___ = 4), 10 operation-right-side equations, and eight 
operation-both-sides equations. Excluding the identity statements, 
14 of the equations involved addition and 14 involved subtraction. 
The score was the number of equations solved correctly, with a 
maximum of 30. With only two identity statements included on 
Open Equations, we excluded these two items from the main 
statistical analysis. The overall internal consistency (Cronbach’s 
alpha) for this sample was .93, regardless of whether we included 
the identity statements. 


Item Measures 


Open Equations included a number of different item character- 
istics, the influence of which we tested in this study. Table 3 shows 
each item by feature, and Table 4 shows the number of items of 
each type for each variable and the mean accuracy for each type. 
We describe each variable in subsequent paragraphs. All were 
coded dichotomously, such that the “1” value represented what we 
theorized to be more difficult for children. Thus, in our statistical 
model, we anticipated negative effects for all of these variables. 


Table 2 


Child Descriptive Statistics (N = 1,796) 
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Grade 1 Spring Grade 2 Fall 
Variable n (%) n (%) 
Gender 
Female 394 (49) 231 (47) 
Male 411 (51) 258 (53) 
Race/Ethnicity 
African American 161 (20) 47 (10) 
Asian American 26 (3) 23 (5) 
Latino 103 (13) 51 (10) 
Other 16 (2) a (1) 
Caucasian 499 (62) 361 (74) 
Child is in special education 
No i> (94) 471 (96) 
Yes 50 (6) 18 (4) 
Child has been retained 
No 762 (95) 479 (98) 
Yes 43 (5) 10 (2) 
Child is an English learner 
No 696 (86) 452 (92) 
Yes 109 (14) 37 (8) 
M (SD) M (SD) 
Addition Fluency score 7.59 (5.40) 12.74 (6.34) 
Subtraction Fluency score 3.84 (3.10) 6.84 (4.61) 
Open Equations score 4.75 (4.56) 8.19 (6.08) 


Note. df = degrees of freedom; ANOVA = analysis of variance. 
Bn 055e pie 01. ie Vp .001: 


Change of 3 or more number line spaces (Change3More). 
This variable represented the number of spaces on a number 
line a child would move to reach the correct answer. As arith- 
metic with a change of one or two number line spaces (e.g., 6 + 
1, 10 — 2) is typically easier than with a change of three or more 
(e.g., 15 — 9, 7 + 8; Henry & Brown, 2008; LeFevre, Sadesky, 
& Bisanz, 1996), the Change3More variable was coded such 


Name: oy) ? 
a t3=7 __.-6=2 
2 2=7-__ , 9=__+4 
» _=4 8-6=___-3 
« 6=24+__ __~3=8-2 
s _-4=3 2 52 +3 
6 34+5=44+__ mn 529 
__=7-4 « 3+ __=8 
[| 
Figure 1. 


Grade 2 Spring Total Group comparison 
n (%) n (%) af Ke 
2 99 
253 (50) 878 (49) 
249 (50) 918 (51) 
8 52) Soa 
103 (21) 311 (17) 
18 (4) 67 (4) 
28 (6) 182 (10) 
18 (4) 41 (2) 
335 (67) LOS (67) 
D 6.00" 
466 (93) 1,692 (94) 
36 (7) 104 (6) 
2 11.66"* 
489 (97) 1,730 (96) 
13 (3) 66 (4) 
2 16,58: an 
463 (92) 1,611 (90) 
39 (8) 185 (10) 
M (SD) M (SD) df ANOVA 
15.99 (6.57) 11.34 (7.00) 2,1793 32016255 
9.00 (6.09) 6.10 (5.02) 2,1793 2AOGS 
13.63 (7.98) 8.17 (Ew2) 2,1793 328.05"™* 


that “0” meant that the item required a change of one or two 
number line spaces (e.g., 5 = 4 + __, change = 1) and “1” 
meant that the item required a change of three or more spaces 
(e.g.,6 = 2 + _, change = 4). Overall, 19 of the 28 items 
required a Change3More. 

Nonstandard equation type (nonstandard). Nonstandard 
equations may be more difficult for children (McNeil, 2008; 


§+4=_ +2 x 6=..~-2 
2S 6 » 95657" _ 
7+2=__. x . +6=9 
___+4=5+2 ae ae 

ae Te a 2 2t+6 
7-__=5 a 8-3=__ 
5+__=9 ~ 6-__=7-3 
3 =2+7 a 7F=4+___ 


Open Equations measure. 
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Table 3 

Open Equation Features by Item , 

No. Equation Change3More Nonstandard OperBoth OppOper SubOper 
1 ese) 1 1 

2 2=7=_ 1 1 
3 _=4 

4 C= ee 1 1 

5 _-4=3 1 1 1 
6 3 a Sa 1 1 1 1 

7 _=7-4 1 1 1 
8 _-6=2 1 1 
9 9=_+4 1 1 1 

10 8-6=_-3 1 1 1 1 
11 _-3=8-2 1 1 1 1 1 
12 5=_+3 1 1 1 : 
13 5=9-_ i 1 1 
14 34+_=8 1 1 

15 5+4=_+2 1 1 1 

16 9-_=6 1 1 
v7 Gach ase 

18 ee) 1 1 1 1 

19 7=_-2 1 1 1 
20 7-_=5 1 1 
mA St — 9 1 1 

DD Sie = 27. 1 1 1 1 

23 6=_-2 1 1 1 
24 9-6=7-_ 1 1 1 1 
25 _+6=9 1 1 

26 7T=_ 

Di =) -EY6 1 

28 8-3 =_ 1 1 
29 6-_=7-3 1 1 1 1 
30 7=4+_ 1 1 1 





Note. Change3More = change of 3 or more number line spaces for arithmetic; Nonstandard = nonstan- 
dard equation type (i.e., equal sign in atypical position); OperBoth = operation-both-sides equation; 
OppOper = to solve equation, opposite operation from operator symbol is required; SubOper = subtraction 


operator symbol. 


Mickey & McClelland, 2014). For Open Equations, nonstandard 
equations included operation-right-side and operation-both-sides 
equations. Here, “0” represented a standard equation (n = 10) and 
“1” represented a nonstandard equation (n = 18). 

Operation-both-sides (OperBoth). Open Equations included 
equations with an operator symbol on the left side of the equation, 
right side of the equation, or equations with an operator symbol on 
both sides of the equation. For this variable, “0” represented a 
operation-left-side or operation-right side equation, and “1” rep- 
resented a operation-both-sides equation because operation-both- 
sides equations are typically more difficult for children to solve as 
interpretation of the equal sign as relational is necessary (McNeil 
et al., 2006). Of the 28 items, 20 involved an operator symbol on 
one side and eight involved operation-both-sides. 

Opposite operation required (OppOper). For some equa- 
tions, the method required for arriving at the correct answer 
involved using the opposite operation of the one shown in the 
problem. For example, 5 + __ = 9 required a subtraction strategy 
(e.g., 9 — 5) to solve, despite that an addition symbol was shown. 
Children typically have more difficulty with solving problems that 
require the opposite strategy from the operator symbols (Orrantia, 
Rodriguez, Mufiez, & Vicente, 2012). For Open Equations, 18 
items required the opposite operation. This was coded such that 
“0” referred to an equation that did not require the opposite 
operation and “1” referred to an equation that did. 


Subtraction operator shown (SubOper). For all 28 items 
requiring an operation, the operation shown was either addition or 
subtraction. We included this variable, as many children demon- 
strate better proficiency with addition over subtraction (Baroody, 
1999; Peters et al., 2014). Here, “0” represented an item for which 
the addition operator symbol (i.e., plus sign) was presented and “1” 
represented an item for which the subtraction operator symbol (i.e., 
minus sign) was presented. Addition and subtraction items were 
balanced on the test, with 14 items of each type. 


Procedure 


For the Grade 1 Spring cohort, Addition Fluency, Subtraction 
Fluency, and Open Equations were administered the second or 
third week of February 2013. Assessment occurred in one 15-min 
whole-class testing session. Examiners were 16 research assistants 
working in pairs. Addition Fluency was administered first, fol- 
lowed by Subtraction Fluency, followed by Open Equations. For 
the Grade 2 Fall cohort, assessment occurred during the third or 
fourth week of October 2012 in one 30-min whole-class testing 
session. Seven examiners administered Addition Fluency, Subtrac- 
tion Fluency, and Open Equations (in that order), followed by two 
other measures. The Grade 2 Spring cohort was tested in one 
40-min whole-class testing session conducted by seven examiners 
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Table 4 


Open Equation Descriptive Statistics (N = 28) 





Open Equations 





Item count accuracy 
Variable n (%) M (SD) 

Change of 3 or more number line spaces 

No (change of 1 or 2) 9 (32) mA (Gis) 

Yes 19 (68) ve (.18) 
Nonstandard equation type (Nonstandard) 

No (standard) 10 (33) 3m (.15) 

Yes 20 (67) 22 (.15) 
Operation-both-sides (OperBoth) 

No (operation-left-side or operation-right-side) 20 (71) 34 (.15) 

es 8 (29) 10 (.04) 
Opposite operation required (OppOper) 

No (not required) 10 (36) eo (.12) 

Yes 18 (64) 28 (.19) 
Subtraction operation shown (SubOper) 

No (addition) 14 (50) nS (.20) 

Yes 14 (50) 22 (.12) 


working in pairs. This session occurred during the second or third 
week of April 2012. Five measures were administered during the 
session, with the first three measures being Addition Fluency, 
Subtraction Fluency, and Open Equations (in that order). Across 
all cohorts, the order of the three measures (i.e., Addition Fluency, 
Subtraction Fluency, and Open Equations) was the same with all 
three measures administered at the beginning of the assessment 
session. All examiners were working on, or had already earned, a 
bachelor’s, master’s, or doctoral degree in education-related fields. 
Ail examiners were trained to administer the three measures fol- 
lowing the same testing procedures and to read from a testing 
script. 


Data Analysis 


For analysis, we used a series of cross-classified random effects 
models (Wilson & De Boeck, 2004) with random person and item 
effects. These models allowed us to explore person characteristics 
and abilities, item characteristics, and interactions between person 
and item variables in the same model. In the present study, we can 
understand which characteristics of children predict their perfor- 
mance on Open Equations items, which characteristics of Open 
Equations items affect item difficulty, and whether certain item 
characteristics affect children differently depending on the chil- 
dren’s arithmetic skill and demographics. 

The analysis employs methods that are becoming increas- 
ingly common in psychology (Baayen, Davidson, & Bates, 
2008) because they do not require different by-person and 
by-item analyses. This makes statistical inference simpler and 
person-item interactions easier to explore. They have been 
applied in recent studies of reading behavior (e.g., Gilbert, 
Compton, & Kearns, 2011; Kearns, 2015; Piasta & Wagner, 
2010), but, to our knowledge, this is the first application of this 
approach to data in mathematics. The details of the analytical 
technique have been described elsewhere (see Gilbert et al., 
2011, for an applied explanation), but a brief description is 
given here. For the binary outcome (correct-incorrect Open 
Equations responses), responses are expected to follow the 


Bernoulli distribution and a logit link is used. Equation 1 shows 
the structure of a simple version of the model using the multi- 
level form familiar to users of multilevel modeling: 


Level — 1 (Responses;;) logit(pji) = oyi 

Level — 2 (Person; & Item;) Aji = ooo + Yo1oChildVar; + Yoo,ltemVar;+ 
To10j + Toorioio ~ NCO, O10)» 
roo ~ N(O, G01) 


(1) 


where p,; is the probability of a correct response by child 7 on 
item i, Ao is the logit of the probability of a correct open 
equation answer from child j on open equation item 7, Yooo is the 
intercept representing the mean logit of a correct response, Yo19 
is the effect of some ability or demographic variable (Child- 
Var,) on child performance, Yoo, is the effect of some item 
characteristic (ItemVar;) on item performance, 7j9;., is the 
child random effect for child j, and 7ro9,,; is the item random 
effect for item i. Child random effects describe the variability in 
child performance on the Open Equations test, and item random 
effects describe the range of difficulties for the open equation 
items. The random effects are expected to be normally distrib- 
uted, as Equation 1 shows. We also tested whether there were 
random effects of classroom (Level 3) and school (Level 4), 
with the same normality assumptions, Upg9, ~ N(O, T001) and 
Mooo01 ~ NO, C-b0001). These are omitted from Equation 1 for 
simplicity but are given in Appendix A. 

We also examined random slopes, that is, whether random 
effects interacted with fixed effects. For example, we consid- 
ered whether item difficulty varied as a function of the fixed 
effect of retention. Put differently, we considered whether chil- 
dren who were retained might have more difficulty with some 
types of items than others even after we accounted for items’ 
overall difficulty. We tested these random slopes to assure that 
the model provided the best representation of the relations 
present in the data. One noteworthy point is that we did not 
permit correlation between the random slopes. Doing this in- 
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creased the estimation time and caused difficulties with con- 
vergence. We had no theoretical rationale for including these 
slopes, so they were omitted. 


Results 


First, we fit an unconditional model to the data containing only 
an intercept and child and item random effects (see Table 5). 
Tables 6 and 7 show correlations among child variables and item 
variables. We then added classroom and school random effects and 
tested whether these improved model fit. Compared with the 
unconditional model with person and item random effects, the 
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person, item, classroom, and school random effects fit better than 
that with person, item, and classroom, Ay{ = 9.49, p = .002, and 
better than that with person, item, and school, Ax? = 343.94, p < 
.001. Using the formula recommended by Cho and Rabe-Hesketh 
(2011), we calculated the person intraclass correlation (ICC) to be 
.29 conditional on items, and calculated the item ICC to be .28 
conditional on person. Thus, children showed variation in their 
performance on prealgebra items, and the items had different 
levels of difficulty. Given the magnitude of the ICCs, there was 
sufficient variability to model children and item performance, and 
we proceeded to answer the research questions. 


model with classroom fit better, Axi = 580.82, p < .001, as did 
the model with school, Ay} = 246.37, p < .001. A model with 


For the first research question, we examined the main effect of 
cohort. For this analysis, binary variables for Grade 1 Spring (‘Yoj0) 








Table 5 
Model Results 
Demographics Child X< item 
Cohort model Fluency model model Item model model 
Fixed effect Coeff. (SE) Coeff. (SE) Coeff. (SE) Coeff. (SE) Coeff. (SE) 
Xo Intercept 1.684" (32) =1-868 (310) =1.715 “Gis) =919" (744) Oe ae il) 
Child 
Yo1o Grade 1 Spring OTATNGIS2) = Geld? = “i50) 2 CLS) —.158 (149) —.155 (.149) 
Yo2o Grade 2 Spring ES 02a C20i),)) ae 192 @leO)ies 808 (.164)*"** 951 (G163); 950.4 5(-163) se 
Yo30 Addition Fluency UL COLO) .104 (.010)*** US COME 126 _(.015)"* 
Yoao Subtraction Fluency SA OIDs 2 COM) B1S2 5 (COD) oe 114 (.016)*™* 
Yoso Race/Ethnicity-African Am. =447 C096) = 419" C103) S418 103 
Yooo Race/Ethnicity-Asian Am. —.014 (.155) HS) KELTL) 160 (.171) 
Yo7o Race/Ethnicity-Latino = SOP E126) 27m AGI37)F S327 A (SDS 
Yoso Race/Ethnicity-Other =.112 (.194) —.070 (.214) —.070 (.214) 
Yooo Male —.028 (.058) —.084 (.064) —.084 (.064) 
Yo1oo Special education 004 (.147) 010 (.169) 010 (.168) 
Yo11o Retained — Als a Gli)s — 488 (.193)* —.486 (.193)* 
Yoi20 English learner 120 (.124) LUT E136) LT Ci136) 
Item 
Yoo: Change of 3 or more spaces 428  (.489) 432  (.488) 
Yoo2 Operation-both-sides 2635 (568) saan cee 62308) od 
Yoos3 Nonstandard equation type DS 34) Sole 37) 
Yoos Opposite operation required 188 (514) .182 (514) 
Yoos Subtraction operation shown —.808  (.490) —.791  (.487) 
Child X Item interactions 
Yo33 Addition Fluency < Nonstandard 
Equation Type 009 (.016) 
Yo35 Addition Fluency X Subtraction 
Operation 1033" "C014)> 
Yoa, Subtraction Fluency < Change of 3 
or More Spaces 025 (.015) 


a 


Random effect 


Toio Child 2.025 
913 Nonstandard equation type 
Yo14 Opposite operation required 
oo, Item 2.282 
Toi; Grade 1 Spring 133 
Yo2, Grade 2 Spring 215 
93, Addition Fluency 
Tog, Subtraction Fluency 
Tos; Race/Ethnicity-African Am. 
To1o1 Special education 
Yo11; Retained 
Classroom 307 
School 156 
Note. Coeff. = coefficient; SE = standard error. 


pS ep. SAO pee 00M 


Variance 
1.071 1.032 986 985 
1.694 1.698 
eS eZ 
2.482 2.538 12 1.314 
120 147 114 Ap 
.150 120 146 147 
001 001 001 .001 
001 .001 .001 .001 
.036 .015 015 
114 .139 wh35; 
.059 .085 085 
181 .187 .170 .169 
.039 .075 .022 .022 
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Table 6 
Correlations Among Child Variables 
Variable 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
1. Open Equations overall score — 
2. Grade 1 Spring — 43 tJ 
3. Grade 2 Fall 00 jS5: — 
4. Grade 2 Spring Ae ORE OS — 
5. Addition Fluency score iO) 248 Z 41 —_ 
6. Subtraction Fluency score (69 See 09 36 fll] — 
7. Gender .07 .00 023 — 102 wld at — 
8. Race/Ethnicity-African American —.14 065 4112 (05g —.16ei Aaa 08 =- 
9. Race/Ethnicity-Asian American 068 F-02 03 00 08 Ope — 
10. Race/Ethnicity-Latino —.14 .08 01 09 aS 14 02 .15 07 _— 
11. Race/Ethnicity-Other (O22 ee OS .05 02 02 2 erat) Suen) — 
12. Race/Ethnicity-Caucasian rite — 09 .09 .00 19 ay, O4e 65 Ot eee — 
13. Special education status —.09 027 06 04 = 1SeeS2 .09 LOOT 049303 04. —.06 = 
14. Retention status Saale O8pessecOSepese 04a —.1 Sipe lO .06 OS yeaa 02, O2es 1201 5:07 17 — 
15. English learner status 2017, 1 Sos Os SSO SOD ea Sal) 17 58 OD eS ae 2 OF 
and Grade 2 Spring (Yo29) were included, with Grade 2 Fall as the estimate, Xo = — 1.87, indicated that a child with mean addition 


reference. The best-fitting model included random item slopes for 
Grade 1 Spring and Grade 2 Spring (79,, and 79>,, respectively), 
Ax3 = 334.34, p < .001. These random slopes suggested that the 
effect of cohort on a child’s likelihood of a correct response on a 
prealgebra item varied across items. For Grade 1 Spring, the 
correlations between cohort and item-specific prealgebra perfor- 
mance ranged from —.10 (Item 3) to —.34 (Item 14). For the Grade 
2 Spring, the smallest magnitude was for Item 8 (r = .12) and the 
largest was for Items 25 and 28 (rs = .39). 

In terms of fixed effects, the intercept, Ny = — 1.68, indicated 
that a child in the Grade 2 Fall cohort had a probability of a correct 
response of .16 for an item of average difficulty. The Grade 1 
Spring effect was significant, Yoj9 = — 0.974, Az = 24.95, p < 
.001, suggesting that a child in the Grade 1 Spring cohort had a 
mean probability of a correct response of .07 for an item of average 
difficulty. By contrast, the significant Grade 2 Spring effect, 
Vo10 = 1.302, Axi = 35.14, p < .001, indicated that a child in the 
Grade 2 Spring cohort had a mean probability of a correct response 
of .41 for an item of average difficulty. 

For the second research question, we examined whether chil- 
dren’s Addition Fluency (yo39) and Subtraction Fluency (Y940) 
would affect their probability of a correct prealgebra response. The 
model with random addition and subtraction fluency slopes (73; 
and 7o4, respectively) fit better than one without them, Ay; = 
401.22, p < .001, suggesting that the effect of arithmetic fluency 
varied across items. Appendix B provides the by-item correlations 
between fluency and Open Equations performance for each item 
and illustrates the degree of variability across items. The intercept 


Table 7 
Correlations Among Item Variables 


Variable 


and subtraction fluency in the Grade 2 Fall cohort would have a .13 
probability of a correct response on an average item. The effect of 
addition fluency, Jo39 = 0.111, was significant , Ay] = 62.92, p < 
.001, and indicated a child in the Grade 2 Fall cohort with addition 
performance one standard deviation above the mean would have a .25 
probability of a correct response versus .06 for an otherwise-average 
child with addition fluency one standard deviation below the mean. 
There was also a significant subtraction effect, Yoyo = 0.114, 
Ayj = 62.42, p < .001, with essentially identical effects on the 
probability of a correct response (1 SD above = .21; 1 SD below = 
.08). It is noteworthy that the addition and subtraction arithmetic 
effects are similar in magnitude, particularly given their correlation is 
quite high (r = .77). The Spring Grade 2 cohort effect was also 
significant, whereas the Grade 1 Spring effect was not. 

We proposed three additional research questions about childand 
item-level differences. Although we considered demographics last, 
we ran a model containing only demographic variables before 
adding item and interaction effects. This was important to establish 
a base against which we could compare the interaction model. 
Table 5 shows the order in which the models were constructed. We 
answer our fifth research question about demographics with the 
results from the model without the interactions, which is the model 
in the middle column of Table 5. 

For the demographics model, we examined whether children’s 
performance on Open Equations varied as a function of children’s 
demographic characteristics. We tested random item slopes for 
each demographic category and found variability for the African 
American (795, = 0.036), special education (79,9, = 0.114), and 


ae a eee 


1. Open Equation response 


2. Change of 3 or more number line spaces (Change3More) 04 — 

3. Nonstandard equation type (Nonstandard) eal = 19 — 

4. Operation-both-sides (OperBoth) = 24 10 A7 — 

5. Opposite operation required (OppOper) 01 =.03 14 07 — 
6. Subtraction operation shown (SubOper) mili —.08 .00 00 — 45 
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retention (79,;,,; = 0.059) parameters. These random effects sug- 
gest that the variability in item responses was greater for these 
groups than the sample as a whole. For African American children, 
for example, the probability of a correct response varied more 
across items than for other racial groups, in which the effect of race 
on the probability of a correct response was similar for all items 
(see Appendix B for the by-item correlations.) The model with 
these three additional random slopes fit the data better than the 
model without them, Ayj; = 64.412, p < .001. 

For the fixed effects, the intercept, Ny = — 1.715, indicated that 
the mean probability of a correct response was .15. This was the 
probability for a child in the Grade 2 Fall cohort who was female, 
Caucasian, not receiving special education services, not previously 
retained, and not given an EL designation. For all subsequent 
models, a child with these characteristics is the reference because 
this group was the largest except for the equivalent Caucasian male 
group (31% of the sample). The intercept also represents the 
probability for a child with average arithmetic fluency on a pre- 
algebra item of average difficulty (799, = 0). The effects of being 
in the Grade 2 Spring cohort and of addition and subtraction 
fluency remained significant, as they had been in the previous 
model. They were also quite similar in magnitude. For the demo- 
graphic variables, there were significant fixed effects for the Af- 
rican American and Latino race/ethnicity categories, such that 
these children had significantly lower performance than their Cau- 
casian peers, Yoso = — 0.447, Axi = 19.03, p < .001, and 
Yor = — 0.370, Ax{ = 8.52, p = .004, both reflecting a mean 
probability of a correct Open Equations response of .11, for a 
female of either race who was not receiving special education 
services, who had not been retained, and who did not have an EL 
designation, compared with .15 for an otherwise identical Cauca- 
sian female. The other significant fixed effect was that for reten- 
tion status, Yoij9 = — 0.418, Ayt = 5.87, p = .02, suggesting a 
probability of a correct response of .10 for a Caucasian female 
child who had been retained, versus .15 without previous retention. 
There were no detectable differences between the Open Equations 
performance of children in the Asian American and Other catego- 
ries and the performance of Caucasian children. We also failed to 
detect differences between male and female children, children with 
and without a special education designation, and children having 
and not having an EL designation. 

For the item-level model, we considered the effects of five 
variables thought to make Open Equations items more difficult: 
Change3More, Nonstandard, OperBoth, OppOper, and SubOper. 
Random child slopes for all five items—with intercept correla- 
tions—improved model fit, but the model including all five slopes 
did not converge because of the large number of effects to esti- 
mate. As a result, we selected random effects based on their 
theoretical importance and reestimated the model. We theorized 
that child variability might be greater for nonstandard equations 
and equations requiring the opposite operation than for other item 
characteristics, on the basis that some children might have ac- 
quired sophisticated understandings of the prealgebraic concepts 
required to do these problems accurately, whereas others remained 
quite unaware. For the model with these two theoretically selected 
random effects, the child fixed effects were very similar to those in 
Model 3, and one item effect was significant—that of OperBoth. 
The magnitude of the effect, Yon. = — 2.631, Ax{ = 15.79, p < 
.001, meant that the mean child’s likelihood of a correct response on 


an Open Equations item with an operation-both-sides was just .03 
compared with .29 for a problem that did not have operation-both- 
sides. Both probabilities are given with the assumption the problem 
did not involve Change3More, Nonstandard, OppOper, or SubOper, 
and given the same child characteristics used throughout. 

The final model involved child X item interactions. The three 
selected interactions were necessarily exploratory, given that no 
prior study has examined prealgebraic performance in this way. 
The first interaction investigated Addition Fluency =< Nonstan- 
dard. We developed this interaction, as children with stronger 
addition skill in early elementary school demonstrate higher math- 
ematics achievement in later elementary school (Geary, 2011). As 
nonstandard equations are more difficult than standard equations 
(Weaver, 1973), we wanted to investigate the influence of arith- 
metic with addition on complex prealgebraic items. We developed 
the second interaction (Addition Fluency < SubOper) because of 
research indicating that arithmetic in addition is easier for children 
(Peters et al., 2014). As children experience difficulty with the 
symbols of mathematics (Driver & Powell, 2015), we wanted to 
investigate whether improved addition arithmetic was related to an 
interpretation of the symbols of addition and subtraction (i.e., 
prealgebra). Our third interaction (Subtraction Fluency X 
Change3More) was generated from research indicating that sub- 
traction is more difficult for children than addition (Peters et al., 
2014). Therefore, we hypothesized that children with stronger 
subtraction skill would demonstrate improved performance on 
equations that require more complex arithmetic. 

One of the three interactions was statistically significant (Addi- 
tion Fluency X SubOper; 4935 = — 0.033, Ax? = 4.79, p < .03). 
For children with average Addition Fluency scores, the probability 
of a correct Open Equations response on an addition problem (at 
the intercept) was .28, compared with .15 for a subtraction prob- 
lem. For children with addition scores one standard deviation 
above the mean, the probability of a correct Open Equations 
response for an addition problem was .49, versus .26 for a sub- 
traction problem. For children with Open Equations scores one 
standard deviation below the mean, the probability of a correct 
Open Equations response for addition was .14, versus .09 for 
subtraction. The interaction indicates that the positive effect of 
addition fluency skill on the probability of a correct response is 
stronger for prealgebraic problems showing an addition operation 
than those showing a subtraction operation. Children with poor 
addition fluency show weak performance regardless of the prob- 
lem type, whereas children with strong addition fluency show 
better performance for problems showing addition than problems 
showing subtraction. Figure 2 shows the interaction. 


Discussion 


The prealgebraic expectations for elementary children have in- 
creased over the last few decades, which means that children must 
establish strong skill in arithmetic to be prepared for prealgebra 
and, subsequently, algebra (Pillay et al., 1998). Prior work at first 
and second grade has not been inclusive of all nonstandard equa- 
tion types (Weaver, 1973) or did not measure differences between 
grade levels (de Corte & Verschaffel, 1981). In addition, this prior 
research was conducted several decades ago and the mathematics 
expectations for children have changed over the years. To under- 
stand the current development of prealgebraic reasoning across 
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Interaction model. The effect of addition fluency performance (Addition Fluency score) on the 


probability of a correct response for problems showing an addition operation (SubOper = 0), the solid line, and 
showing a subtraction operation (SubOper = 1). The interaction indicates that the effect of better addition 
performance on the probability of a correct response was smaller for problems showing a subtraction operation 
than those showing an addition operation. OE = Open Equations. 


first and second grade and the connection between arithmetic and 
prealgebra, we asked five research questions. 

With our first research question, we explored whether prealge- 
braic reasoning differed among the three cohorts. On Open Equa- 
tions, Grade 2 Spring children had the highest probability for 
correct response, followed by Grade 2 Fall children. Grade 1 
Spring children had the least probability for correct response. 
Results concerning cohort differences on Open Equations indicate 
that prealgebraic reasoning is a rapidly developing skill during the 
early elementary grades. On average, children gain 4 points from 
Grade 1 Spring to Grade 2 Fall, and then another 4 points from 
Grade 2 Fall to Grade 2 Spring. This rapid change is likely due to 
several reasons. First, early elementary children are learning about 
arithmetic starting in late kindergarten. As equation solving re- 
quires an understanding of arithmetic, daily exposure to addition 
and subtraction may increase prealgebraic reasoning. Additionally, 
teachers may be providing explicit instruction about the equal sign 
as a balance from first through second grade. In previous research, 
the lack of explicit instruction about the equal sign as a relational 
symbol lead to children to solve equations, especially nonstandard 
equations, incorrectly (Powell & Fuchs, 2010; Sherman & Bisanz, 
2009). As teachers provide instruction related to the equal sign, 
nonstandard equation solving typically improves (Powell, Driver, 
et al., 2015). Interestingly, the probability of a correct response on 
Open Equations at Grade 1 Spring is relatively low (.16) for an 
item of average difficulty. As mathematics standards expect that 


first-grade children can understand the equal sign as a relational 
symbol and solve equations with the unknown in various positions 
(i.e., reason prealgebraically), our results indicate that most first- 
grade children are underprepared for such a prealgebraic task. 
Even children almost finished with second grade (i.e., Grade 2 
Spring) do not answer even half of equations correctly, which 
indicates that curricula should focus more on prealgebraic reason- 
ing than the current landscape. 

With our second research question, we investigated whether 
arithmetic fluency predicted prealgebraic reasoning. We hypothe- 
sized that children with strong arithmetic fluency would solve 
more equations successfully, indicating a connection between 
arithmetic and prealgebraic reasoning (Pillay et al., 1998). Results 
indicate that the influence of arithmetic varied across the items on 
Open Equations. For Grade 2 Fall children, addition fluency and 
subtraction fluency were significant predictors of prealgebraic 
performance, with addition and subtraction effects similar in mag- 
nitude. Similarly, addition fluency and subtraction fluency were 
significant predictors for Grade 2 Spring children. For Grade 1 
Spring, however, our hypothesis was not corroborated without a 
significant predictor in either arithmetic measure. 

The results from our second research question indicate a 
correlation between arithmetic and mathematical equation solv- 
ing (i.e., prealgebra) for second-grade children. That is, second- 
grade children with stronger arithmetic fluency score higher on 
Open Equations. This confirms the work of Carraher et al. 
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(2006), Kieran (2004), and Pillay et al. (1998), in which arith- 
metic is described as essential in the arithmetic to prealgebra to 
algebra sequence and necessary for competence with algebra 
(Boulton-Lewis et al., 2000). Solving mathematical equations, 
such as those presented on Open Equations, involves a combi- 
nation of understanding arithmetical operations (e.g., addition, 
subtraction) and applying prealgebraic reasoning. If second- 
grade children have stronger arithmetic skill, our result indi- 
cates that children use this arithmetic knowledge for prealgebra. 

The result for the influence of arithmetic on prealgebra, however, 
was not significant at first grade. The scores of Addition Fluency 
(M = 7.59) and Subtraction Fluency (M = 3.84) for Grade 1 Spring 
are significantly lower than Addition Fluency and Subtraction Flu- 
ency scores for Grade 2 Fall or Grade 2 Spring, so insignificant 
predictors at first grade could be because of emerging arithmetic skill. 
With an assumption that arithmetic fluency eventually predicts pre- 
algebraic performance, these results indicate that more exposure to, 
and practice working with, arithmetic may increase how children 
work with prealgebraic equations. We caution, however, that merely 
knowing arithmetic facts may not be enough to influence equation- 
solving performance, as purported by the hypotheses of Filloy and 
Rojano (1989) and Herscovics and Linchevski (1994). Solving pre- 
algebraic equations involves interpretation of the operator symbols, 
the equal sign, and an understanding of whether addition or subtrac- 
tion is necessary to balance two sides of an equation, and this aligns 
with Pillay et al.’s (1998) definition of prealgebra. 

With our demographics model, which was exploratory, we 
investigated which demographic variables explain variation in 
equation-solving performance after accounting for arithmetic flu- 
ency. Unlike previous research indicating mathematics perfor- 
mance differences between females and males (Penner & Paret, 
2008; Royer, Tronsky, Chan, Jackson, & Marchant, 1999), we did 
not find significant differences based on gender. Similar to gender, 
special education status and EL status were not significant predic- 
tors of prealgebra performance. 

We learned of significant differences based on race/ethnicity 
and retained status. In terms of race/ethnicity, African American 
children and Latino children demonstrated lower prealgebraic per- 
formance. There were not significant differences between Asian 
American and Caucasian children. These results about race/eth- 
nicity mirror National Assessment of Educational Progress data 
for fourth and eighth grade (National Center for Education Statis- 
tics, 2014). That performance differences emerge based on race/ 
ethnicity status during first and second grade indicates that various 
factors (e.g., home mathematics exposure, preschool mathematics 
experiences, and kindergarten mathematics experiences) contrib- 
ute to differential mathematics performance for children as early as 
6 years old. With these differences emerging early in a child’s 
schooling, it may be important to provide targeted mathematics 
intervention to children who exhibit difficulty with prealgebraic 
concepts. These results about differences based on demographics 
should be interpreted with caution because we did find significant 
differences based on race/ethnicity among cohorts; the Grade 2 
Fall cohort had approximately 10% fewer African American chil- 
dren than the other two cohorts. 

For the 4% of children in our sample who had been retained, 
prealgebra performance was significantly lower than children who 
had not been retained. As the children in our sample were either 
retained in kindergarten or first grade, these results substantiate re- 


search for kindergarten and first grade about retention. For example, 
the research of Burkam, LoGerfo, Ready, and Lee (2007) indicates 
that kindergarten children who have been retained do not receive any 
notable benefit from the retention and continue to perform below 
peers on mathematics measures. Similarly, Willson and Hughes 
(2009) indicate that first-grade children who have been retained dem- 
onstrate lower mathematics performance than promoted peers. The 
results related to retained status also suggest that teachers may need to 
provide additional or differentiated intervention for children about 
prealgebraic concepts and procedures. When performance differences 
based on demographic characteristics are noted in the later grades 
(i.e., fourth grade, eighth grade), it is important to realize that warning 
signs about lower prealgebraic understanding emerge early. 

We investigated the characteristics of prealgebraic items, which 
may relate to prealgebraic performance, with our item-level model. 
Based on prior research, we characterized items (a) with a change of 
three or more number line spaces (Henry & Brown, 2008), (b) in 
nonstandard form (Powell, Driver, et al., 2015), (c) with operator 
symbols on both sides (McNeil et al., 2006), (d) that required the 
opposite operation of the operator symbol (Orrantia et al., 2012), and 
(e) used a subtraction operation (Peters et al., 2014). We learned that 
the item characteristic related to operation-both-sides equations was 
significant. The mean likelihood of a correct response on an item with 
operation-both-sides was .15 compared with .73 with equations with 
operation-right-side or operation-left-side. The majority of research 
about equivalence has used operation-both-sides equations, and re- 
searchers state these equations are the best type of equation to use 
when evaluating a child’s understanding of the relational nature of the 
equal sign. That first- and second-grade children have difficulty with 
operation-both-sides equations is not surprising. It is likely that chil- 
dren have not received exposure to many operation-both-sides equa- 
tions, given the overwhelming number of instances of operation-left- 
side and, to a lesser extent, operation-right-side standard equations in 
elementary mathematics materials (Powell, 2012). Additionally, chil- 
dren likely receive little instruction on symbols, such as the equal sign, 
and the meaning associated with such symbols (Powell, 2012). This 
result indicates that the majority of children do not interpret the equal 
sign as a balance between the two sides of an equation, which is 
necessary for prealgebraic reasoning. Interestingly, children as young 
as 6 and 7 years old do understand the concept of balance and 
sameness, and can demonstrate skill with four-value equations when 
the equations are presented in another format (e.g., concrete manipu- 
latives or pictorial representations; Driver & Powell, 2015; Sherman 
& Bisanz, 2009). 

With our final model, we were interested in interactions between 
child-level data and Open Equations item-level data. Of the three 
interactions tested, only one was significant. The significant interac- 
tion demonstrated that a child’s Addition Fluency score had a differ- 
ential effect on prealgebraic equations with an addition operator 
symbol versus a subtraction operator symbol. That is, children with an 
average score on Addition Fluency had a greater probability of cor- 
rectly answering equations with an addition operator symbol than a 
subtraction operator symbol. Interestingly, after this interaction dem- 
onstrated significance, we ran a fourth interaction with Subtraction 
Fluency X SubOper. This interaction was not significant, and these 
results are not terribly surprising, given that children acquire an 
understanding of addition before subtraction (Gilmore, McCarthy, & 
Spelke, 2007) and demonstrate better proficiency with addition over 
subtraction in the early elementary grades (Canobi, 2004). For chil- 
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dren with Addition Fluency performance one standard deviation 
above the mean, the difference was steeper than for children with 
Addition Fluency performance one standard deviation below the 
mean. These results demonstrate (a) how helpful it is for children to 
have strong skill with addition (i.e., arithmetic), and (b) how the 
addition operator symbol is easier for children to manipulate than the 
subtraction operator symbol. 

Before we conclude, we note our study’s limitations. First, we did 
not include Grade 1| Fall children in the study. Given that children are 
beginning to work with symbols and learn about the solving of 
equations in first grade, we determined it would be unfair to ask Grade 
1 Fall children to solve the Addition Fluency and Subtraction Fluency 
items. Based on prior experience, we also realized that Grade 1 Fall 
children would be overwhelmed by the prealgebraic reasoning task. 
We understood Open Equations might be difficult for Grade 1 Spring 
children. However, with 6 months of prior addition and subtraction 
instruction, we had fewer concerns about asking Grade 1 Spring 
children to work on our measures, similar to Weaver’s (1973) study 
and the research of de Corte and Verschaffel (1981). Second, we did 
not gather any information about child mathematics anxiety or self- 
efficacy, mentioned by Cheema and Galluzzo (2013) as important 
considerations for mathematics performance. We also did not gather 
reading, language, or working memory data, which researchers have 
linked to mathematics performance (Fuchs et al., 2006; Swanson, 
Lussier, & Orosco, 2015). Future iterations of this research should 
broaden the data gathered about the children. Third, we administered 
Addition Fluency, Subtraction Fluency, and Open Equations under 
timed conditions. As we assessed arithmetic fluency, rather than 
accuracy, with the arithmetic measures, the timed administration is 
appropriate. With Open Equations, which requires prealgebraic rea- 
soning, it may be better to administer the measure in an untimed 
condition. In an analysis of child responses, however, we noted that 
26 of the 30 Open Equations items had response rates greater than 
50%. This indicates that the majority of children have the opportunity 
to answer the majority of prealgebraic items in the 8-min time limit. 
With an untimed administration on all measures, we could measure 
accuracy rather than fluency. 

Our definition of prealgebra may be another limitation of this 
study. We utilized a definition of prealgebra provided by Pillay et al. 
(1989), in which children work with one variable and there is a focus 
on the relational meaning of the equal sign for balancing sides of an 
equation. On Open Equations, children solved equations with one 
variable, and a relational interpretation of the equal sign was neces- 
sary to solve the majority of equations; therefore, the Open Equations 
measure fulfills criteria for an assessment of prealgebra. Although 
prealgebra research and pedagogy during the early elementary grades 
often utilizes mathematical equations similar to those on Open Equa- 
tions (e.g., Jacobs et al., 2007; McNeil, 2007; Stephens, Blanton, 
Knuth, Isler, & Gardiner, 2015), there may be other avenues for 
assessment of prealgebraic knowledge. For example, researchers use 
function tables and functions for demonstration and assessment of 
early algebraic knowledge (Carraher et al., 2006; Powell & Fuchs, 
2014; Stephens et al., 2015; Warren, Cooper, & Lamb, 2006). Func- 
tions do not necessarily require a relational understanding of the equal 
sign, which does not fit with the prealgebra definition provided by 
Pillay et al. (1989) and others (e.g., Kieran, 2004; Kilpatrick et al., 
2001), but understanding functions may be an important component 
of the arithmetic to prealgebra to algebra framework. Our definition of 
prealgebra may be narrow, and future research should investigate 


whether a broader definition of prealgebra provides more meaningful 
information about the prealgebraic knowledge of children in the 
elementary grades. 

In sum, our results indicate that prealgebraic growth is linear. As 
children receive more exposure and practice with arithmetic, prealgebraic 
performance improves. This corroborates Pillay et al.’s (1989) model, in 
which arithmetic skill is foundational to prealgebraic reasoning. In terms 
of prealgebraic performance, children with stronger skill arithmetic skill 
solve more open equations correctly, and we learned that children with 
better arithmetic fluency in addition demonstrate significantly better per- 
formance on Open Equations with the addition operator symbol; the same 
is not true for fluency with subtraction. As noted by de Corte and 
Verschaffel (1981), arithmetic operation may contribute to differences in 
performance, and our study substantiates this claim. When focusing on 
the properties of prealgebraic items, we learned that operation-both-sides 
equations were significantly more difficult than operation-right-side or 
operation-left-side equations, which also corroborates the dimensions of 
difficulty outlined by de Corte and Verschaffel. This result is important, 
especially in light of research with later elementary children on the 
difficulty of solving operation-both-sides equations (McNeil, 2007; 
Rittle-Johnson & Alibali, 1999). Early elementary children are expected 
to reason prealgebraically as early as kindergarten, and our results dem- 
onstrate variability by age level, skill with arithmetic, prealgebraic item 
characteristics, and certain demographics. In order to prepare children for 
the rigors of algebra in late elementary, middle, and high school, strong 
prealgebraic skills are necessary. By strengthening arithmetic understand- 
ing and skill, and by providing targeted instruction on specific prealge- 
braic characteristics, prealgebraic performance is likely to improve. 
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Appendix A 
Equations for Child and Item Model (Model 4) 
Level 1 (Response;,,,,;) 
logit (D jemi) = Nojikm 
Level 2 (Person,,,,, & Item;) 
Nojikm = Yooowm + (Yotokm + To11 j)Grade1 Spring; + (Yoookm + To21 j)Grade2Spring; + (Yo30%m + 031 pAdadTotal; + (Yosoun + Yo41 jp SubTotal; 
+ (Yosonm + Yos1 jRaceA fAm,; +  Yo60kmRaceAsianAm, + Yo70kmRaceLatino; af Yos0kmlaceOther; + (Yo100Km + Yo101 pSpEd; 
+ (YottoKm + Yo111 j)Retained; + Yo12kmELL; + Yoo1 Change3More; + Yoo2Equation4; + (Yo93 + 1913;)Nonstandard; 
+ (Yooa + To14i)OppOper; + YoosOperSub; + roi; + Toor 


2 2 2 2 2 2 2 2 
97001,r001 %7001,r011 %7001,r021 9% 7001,7031 %r001,7041 %r001,7051 %r001,70011 %7001,70111 


To01 0 5 5 
To 0 Or11,001 = F011 
2D 2 
021 0 97021,r001 07021 
2 2 
1031 0 97031,r001 07031 
e ~MN o | A 5 } 
041 97041,r001 07041 
N 2 a 
a ; 07051,r001 07051 
T0101 2 2 
970101,7001 970101 
You 0 7 5 
970111,r001 970111 
2 2 2 
010 0 97010,7010 9,010,013 %7010,r014 
2D 2 ae 
Yo13 |~M 0 |.) So13,010 Fo3 
Toi4 0 2 as 2 
97014,7010 97014 


Level 3 (Classroom,,,,) 


Las 2 * 
Yooom = Poooom + Uooo1s “ooo1 ~ NV (0, Tooo1) . 


Level 4 (School,,,) 


oo 2 
o000m = 0000 + Eoo001> Eooo01 ~ V. (0, Loo001) % 


“All child fixed effects are nested within classroom and school in the same way as the intercept. This is not shown to save space. 
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Appendix B 


Item Accuracies and Correlations 


Items Accuracies and Correlations With Variables With Random Slopes 


1 StS] ap +139 24 43 34 —.16 eG —.10 
2 =e 38 =29 23 36 36 —.06 —.04 —.04 
3 _=4 37 —.10 14 26 22 —.09 —.01 —.06 
4 C= 2 A8 = 33 OF A2 36 —.11 =a) —.09 
5 C4 32 —.20 19 OT 27 —.06 —.04 —.04 
6 34+5=44+_ 19 —.23 26 39 40 —.04 05 —.05 
7 ape 33 — 30 24 AB Al —.10 —.04 —.09 
8 26 =2 30 —12 12 20 24 = 08 —.03 —.03 
9 9=_+4 51 — 32 26 Al 37 —.09 = 10) 107 
10 86h 3 07 =a5 Oy 25 30 —.04 01 —.03 
11 3 = 8? 09 =16 19 28 33 —.04 = 01 —.04 
12 5= +3 45 — 32 Oy Al 35 —.08 =.08 Ete 
13 = Oe 32 — 30 OF A5 43 —.08 —.04 —.05 
14 8 do 8 54 — 34 29 5D 44 —.13 =10 = 10} 
15 5+ 4.= WO ll —.18 24 35 38 = 02 02 —.05 
16 OS TG 35 —.30 36 51 50 — At —.04 —.10 
17 TD SS. Al —.29 35 A8 44 —.08 —.05 —.07 
18 Ga see 2 12 —.19 26 38 Al —.04 —.04 —.04 
19 ee. a1) —.20 8 38 38 —.03 —.03 —.05 
20 ea wea—s5 29 — <3 38 52 51 —11 —.06 = 11 
21 5+ .=9 31 — 31 36 54 50 =i ==08 —.12 
22 a eee a= tet 7 ll = 20 28 40 43 0 —.04 =05 
23 Gi 13 —.20 29 35 39 —.04 —.02 —.06 
24 C26 09 —.20 28 39 A8 =105 —.03 —.03 
25 _+6=9 25 —.30 39 53 52) — 12 Oi) =O 
26 7 =. m1) —.18 28 38 Al 10 —.04 —.06 
27 = 2.46 18 —27 38 A6 50 —.05 —.04 —08 
28 ata 24 —27 39 il 53 —.10 —.04 —.08 
29 Ge = 3 07 —.20 25 38 AT —.06 —.05 —.05 
30 Tad 15 —.26 36 46 A8 —.07 =105 —.07 


NO i i eg dS Se 
Note. Acc = Accuracy; GrlS = Grade 1 Spring; Gr2S = Grade 2 Spring; AddFlu = Addition Fluency; SubFlu = Subtraction Fluency; AfrAm = African 
American; SpEd = special education; Ret = retained. 
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Task-Appropriate Visualizations: Can the Very Same Visualization Format 
Either Promote or Hinder Learning Depending on the Task Requirements? 


Alexander Soemer 
University of Potsdam 


Stephan Schwan 
Leibniz—Knowledge Media Research Center, 


Tiibingen, Germany 


In a series of experiments, we tested a recently proposed hypothesis stating that the degree of alignment 
between the form of a mental representation resulting from learning with a particular visualization format 
and the specific requirements of a learning task determines learning performance (task-appropriateness). 
Groups of participants were required to learn the stroke configuration, the stroke order, or the stroke 
directions of a set of Chinese pseudocharacters. For each learning task, participants were divided into 
groups receiving dynamic, static-sequential, or static visualizations. An old/new character recognition 
task was given at test. The results showed that learning both stroke configuration and stroke order was 
best with static pictures (Experiments 1 and 2), while there was no reliable difference between the groups 
for learning stroke direction (Experiment 3). An additional experiment, however, revealed that learning 
with sequential pictures was superior when testing was carried out with sequential pictures, irrespective 
of the learning task (Experiment 4). The combined evidence from all experiments speaks against task 
requirements playing a role in determining the effectiveness of a visualization format. Furthermore, the 
evidence supports the view that a high degree of congruence between information presented during 
learning and information presented at test results in better learning (study-test congruence). Implications 


for instructional design are discussed. 


Keywords: animations, pictures, task requirements, study-test congruence, Chinese characters 


Educational research has yet to determine the factors underlying 
a successful application of static, static-sequential, and dynamic 
visualizations to situations in which they have the potential to 
support learning instead of merely being a design characteristic. 
On the one hand, a substantial number of empirical reports sug- 
gests that dynamic visualizations may not benefit or might even 
hinder learning compared to static or sequential counterparts (e.g., 
Hegarty, Kriz, & Cate, 2003; Mayer & Chandler, 2001; Mayer, 
Hegarty, Mayer, & Campbell, 2005). However, there are also 
numerous demonstrations of learning with dynamic visualizations 
being superior to learning with static visualizations (e.g., Castro- 
Alonso, Ayres, & Paas, 2015; Imhof, Scheiter, Edelmann & Ger- 
jets, 2011; Van Gog, Paas, Marcus, Ayres, & Sweller, 2009; Wong 
et al., 2009). It is still under debate which factors drive the 
usefulness of sequential and dynamic visualizations in comparison 
to static visualizations (Lowe & Schnotz, 2014), although there are 
both theoretical reasons (Tversky, Morrison, & Bétrancourt, 2002) 
and empirical reasons (Hoffler & Leutner, 2007) to believe that 
factors related to the specific learning task and the specific learn- 
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ing content should predict the superiority of a certain visualization 
format. 

The current article reports four experiments investigating the 
interplay between task requirements, visualization format during 
learning, and visualization format at test. The focus of the first 
three experiments lies on a recently proposed hypothesis that we 
will label the task-appropriateness hypothesis in the following 
(Lowe, Schnotz, & Rasch, 2011). According to this hypothesis, 
learning performance depends on the alignment between the form 
of a mental representation resulting from learning with a particular 
visualization format and the requirements of a specific learning 
task. An interesting prediction of the task-appropriateness hypoth- 
esis is that a dynamic visualization format might be more useful 
for learning dynamic aspects of a content, while a static visualiza- 
tion format might be more useful for learning static aspects of the 
very same content. Likewise, sequential visualizations might be 
more suitable for learning sequential aspects of that content. 

We conducted a series of three experiments to investigate this 
prediction. Looking ahead, because the results of these experi- 
ments did not support the task-appropriateness hypothesis, we 
conducted an additional experiment aiming at explaining the re- 
sults and at investigating the possibility that learning performance 
depends on the match between the presentation format provided 
during learning and the visualization format at test, which turned 
out to be the case. 


Learning With Static and Dynamic Visualizations 


In a common study investigating the potential benefits of learn- 
ing with static and dynamic visualizations, different groups of 
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participants are presented with (a) a single picture, (b) multiple 
pictures simultaneously, (c) a sequence of temporally separated 
pictures, or (d) an animation. Great care is taken to equalize 
content between the visualization formats. Furthermore, the visu- 
alizations are often accompanied by spoken or written text. The 
topics in such studies can be as diverse as mechanical systems 
(Boucheix & Schneider, 2009), Lego (Castro-Alonso, Ayres, & 
Paas, 2015), locomotion patters (Imhof, Scheiter, Edelmann & 
Gerjets, 2011), time zones (Schnotz & Rasch, 2005), motor skills 
(Wong et al., 2009), or Chinese characters (Soemer & Schwan, 
2012). 

At the very beginning of animation research, the results of such 
studies have been somewhat conflicting with regard to the question 
of whether dynamic visualizations can promote learning better 
than static visualizations, at least for contents that involve temporal 
changes (Tversky, Morrison, & Bétrancourt, 2002). Early studies 
showing a superiority of dynamic visualizations for science learn- 
ing (e.g., Park & Gittelman, 1992; Rieber, 1991) were subse- 
quently criticized for methodological issues (Tversky, Morrison, & 
Bétrancourt, 2002), and followed by demonstrations showing that 
learning with static visualizations can be superior to learning with 
dynamic visualizations (e.g., Hegarty, Kriz, & Cate, 2003; Mayer 
et al., 2005; Scheiter, Gerjets, & Catrambone, 2006). In more 
recent years, researchers have sought to identify the specific con- 
ditions under which one visualization format might be superior to 
others, with both recent empirical reports (e.g., Castro-Alonso, 
Ayres, & Paas, 2015; Imhof et al., 2011; Wong et al., 2009) and a 
meta-analysis (Héffler and Leutner, 2007) suggesting the possibil- 
ity that dynamic visualizations might be generally helpful for 
procedural-manipulative tasks (Sweller, Ayres, & Kalyuga, 2011). 

In order to theoretically explain the diversity of results in the 
literature, one assumption has been that different visualization 
formats convey the content to be learned in a more or less optimal 
form with regard to the resulting mental representation of that 
content. For example, Tversky, Morrison, and Bétrancourt (2002) 
have suggested that contents involving “change over time, for 
example movement or transformation or process” (p. 258) would 
benefit from dynamic visualizations, while other types of contents 
would be more suitable for learning with static visualizations. 
However, there are numerous examples in the literature demon- 
strating learning being superior using static visualizations or static- 
sequential visualizations compared to learning using dynamic vi- 
sualizations for material that contains temporal aspects (e.g., 
Lowe, Schnotz, & Rasch, 2011; Mayer et al., 2005). Conversely, 
Soemer and Schwan (2012) recently found no statistically signif- 
icant differences between two groups of students learning shapes 
and meanings of Chinese characters with static and dynamic 
visualizations when viewing times were equalized. Such examples 
suggest that factors other than content play important roles in 
successful learning. 

One possibility that has recently been raised by Lowe, Schnotz, 
and Rasch (2011) is that the usefulness of a visualization format 
depends on the requirements of a specific learning task. According 
to this view, the form of the mental representation resulting from 
learning with a specific visualization format needs to be aligned 
with the requirements of a learning task for successful learning. 
Lowe et al. have supported this task-appropriateness hypothesis 
by comparing participants’ performance on putting the key stages 
of a kangaroo hop in order after viewing a set of eight images 


depicting these stages in either a dynamic, a static-sequential, or a 
static-simultaneous visualization format. The results showed supe- 
rior performance of the static-sequential group in-line with the 
hypothesis that the learning task (serial-order learning of hop 
states) and the visualization format (sequential presentation) have 
to match for successful learning. 

The task-appropriateness hypothesis makes the interesting pre- 
diction that the factor learning task and the factor visualization 
format interact with each other. For example, a sequential visual- 
ization format might be more useful for one task, while a static 
visualization format might be more useful for a second task, even 
if the both learning tasks concern the same learning material. 
However, at present, the very idea that the same visualization 
format could either promote or hinder learning different aspects of 
the same material, depending on the task requirements, has not yet 
been tested directly. One reason for this is the problem that 
learning tasks are often difficult to manipulate independent of 
learning contents, meaning that one has to compare across differ- 
ent studies that vary widely in learning content and design char- 
acteristics. We think that an investigation of the relation between 
task-requirements and visualization formats necessitates a crossing 
of the factors task requirements and visualization format within the 
same learning material. The task-appropriateness hypothesis can 
then be supported if one can show that dynamic aspects of the 
learning material are most effectively learned with animations, 
while static and sequential requirements of the very same material 
are most effectively learned with static pictures and picture se- 
quences, respectively. 


The Current Study 


For a test of these predictions, it is necessary to have a type of 
learning material that includes a static requirement, a sequential 
requirement, as well as a dynamic requirement. Here, static re- 
quirement is understood as learning a fixed configuration of ele- 
ments related to each other; importantly, neither the elements nor 
their relationship change over time. Sequential requirement is 
understood as learning a configuration of elements related to each 
other that includes at least two different discrete states. Further- 
more, the states can be put in order along a specific criterion (¢.g., 
time) and are of particular importance for learning. Dynamic 
requirement is understood as learning a continuously changing 
configuration of elements. In other words, it is the motion aspect 
of this information (i.e., trajectory) which is of particular relevance 
for learning, not specific states along the motion trajectory. 

Fortunately, there is one type of real-world learning material 
that supplies researchers with exactly these conditions: Chinese 
characters. Chinese characters are composed of a set of one to 
several standardized strokes fitted into a certain spatial configura- 
tion. A learner of the Chinese script has to accomplish at least three 
basic learning tasks related to the characters’ shapes alone (see 
Figure 1). First, the spatial configuration of the strokes needs to 
be learned in order to both read and write a character. A subtle 
change in spatial configuration may result in a completely different 
character with a different meaning. Second, the strokes of each 
character have to be written in a fixed sequence and violations of 
this sequence may result in a hard-to-read handwriting. Finally, not 
only the sequence of strokes but also the direction in which each 
stroke has to be drawn is fixed (e.g., similar looking horizontal 
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Figure 1. 
Stroke order; Right: Stroke directions. 


strokes may be drawn rightward in some situations but leftward in 
other situations). In sum, learning the shape of Chinese characters 
comprises at least three tasks: One task is related to the static 
configuration of strokes, a second task is related to the sequential 
order of strokes, and a third task is related to the dynamic unfold- 
ing of a stroke. 

Applying the task-appropriateness hypothesis to the present 
case, learning the stroke configuration of a character should be best 
supported by static visualizations, while learning the stroke se- 
quence should be best supported by sequential visualizations. 
Lastly, learning the direction of strokes should be best supported 
by viewing a dynamic visualization. We tested this prediction in a 
series of experiments. 


Experiments 1-3 


In the first three experiments, we instructed participants to learn 
either the stroke configuration of a set of stroke patterns resem- 
bling Chinese characters (Experiment 1), the sequence of strokes 
of the same set of stroke patterns (Experiment 2), or the stroke 
directions (Experiment 3). Using a between design, one group in 
each experiment viewed a static picture of each character, another 
group viewed a static-sequential presentation of the stroke se- 
quence, and a third group viewed a dynamic presentation of the 
unfolding strokes. 

The learning material for all experiments was designed to be 
consistent in visual complexity as measured by several shape 
properties (e.g., stroke count, connectedness of the strokes). In 
addition, the stroke-sequence task in Experiment 2 required only a 
small number of strokes for each character (i.e., four) in order to 
ensure that the material was within the limits of the assumed item 
capacity of working memory (Luck & Vogel, 2013). Because there 
were not enough existing characters fulfilling all our constraints, 
we composed nonexistent stroke patterns by changing certain 
properties of existing Chinese characters. The design change in- 
cluded shortening and lengthening as well as rotating and warping 
strokes. In order to make this procedure as objective and random 
as possible, our custom-made experimental software automatically 
created a set of characters separately for each participant by 
applying the above described transformations with random param- 
eters on 10 out of 20 base configurations during the runtime of the 
experiment. Thus, each participant received a slightly different set 
of items, a procedure that increased the item pool and, thus, the 
generality of the experimental results. Likewise, the order of 





Three requirements for learning Chinese characters. Left: Configuration of a character; Middle: 


strokes and stroke directions for each stroke pattern were deter- 
mined randomly at runtime for each participant. 

For each experiment, a different participant sample was re- 
cruited from the institute’s participant pool, each participant re- 
ceiving four euros after completion of the experiment. In every 
experiment, participants were randomly assigned to one visualiza- 
tion condition. The procedure was the same for all three experi- 
ments. One to six participants were tested at the same time (group 
experiment). After giving their written consent and assurance that 
they had read and understood the conditions of participation (in- 
cluding their right to abort the experiment), participants were 
briefly introduced to the general task (depending on the experi- 
ment, learning either the stroke configurations, the sequence of 
strokes, or the stroke directions). In addition, each participant 
received specific task instructions on the computer screen depend- 
ing on the specific visualization format. The participants then saw 
an example of a stroke pattern to be learned in their specific 
visualization format for 20 s. Before starting the main session, 
participants were allowed to ask questions in case of uncertainty 
about the task requirements. After ensuring that all participants had 
understood the task, they were permitted to begin the experiment. 
By pressing a key, the 10 customized stroke patterns were pre- 
sented one after another for 20 s each. Parts of the procedure that 
differed between the experiments (in particular, the visualization 
formats) are described in more detail below. 

As mentioned earlier, by task requirements we are referring to 
specific static, sequential, or dynamic aspects of the material to be 
learned. Connected to this issue is the question of the response 
mode as the type of behavioral act that is used to measure learning. 
In our case, we decided on old/new recognition as a measure of 
learning, which has some advantages in the present case. Most 
importantly, in order to compare the three task requirements, it is 
necessary to use a response mode that can be used across all 
conditions and experiments. Another particular advantage of old/ 
new recognition is that it avoids the extraordinary difficulty that 
participants have in reconstructing abstract stroke patterns and the 
subjective element of judging the correctness of strokes (see the 
discussion in the limitations section of Soemer & Schwan, 2012). 
The results of the first three experiments can be seen in Table 1. 


Experiment 1 


In Experiment 1, participants were divided into three groups 
according to the factor visualization format (static, sequential, 
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Table 1 
Descriptive Statistics for the Data of Experiments 1-3 





Experiment 1 


Static Sequential Dynamic Static 
M .76 70 1 .74 
SD .09 .09 .08 3 
N 28 28 28 32 


Experiment 2 


Experiment 3 


Sequential Dynamic Static Sequential Dynamic 
.66 70 .66 .65 64 
.10 14 a3 sil .09 
33 30 30 30 30 





dynamic) and were instructed to learn the stroke configuration of 
a set of 10 randomly generated stroke patterns. All stroke patterns 
were created separately for each participant during runtime and 
fulfilled the constraints described above. According to both the 
task-appropriateness hypothesis one should expect a superiority of 
the static visualization format over the other two visualization 
formats. 

Participants. Fifty-three female and 31 male students be- 
tween 19 and 60 years of age (M = 24.5, SD = 6.3) were recruited 
from the institute’s participant pool, 28 participants being assigned 
to each one of the visualization format conditions (static: M = 23.8 
years, SD = 2.9, 11 male participants; sequential: M = 25.4 years, 
SD = 8.3, 12 male participants; dynamic: M = 24.5 years, SD = 
6.6, 10 male participants). Participants with self-reported knowl- 
edge of Chinese characters (e.g., Chinese and Japanese learners) 
were excluded from the experiment. 

Apparatus and procedure. The experiment was conducted 
on laptops with 12.1-in. screens. Display resolution was set to 
1024 X 768 pixels and the refresh rate to 60Hz. Stroke patterns 
were presented in black on a light gray background and were fit 
into a 200- X 200-pixel-sized frame. In the static conditions, all 
strokes of a pattern appeared at once (Figure 2, left) and each 
pattern was visible on the screen for 20 s. In the sequential 
condition, all strokes of each pattern first appeared at once in gray 
and during the 20 s presentation time, black strokes were super- 
imposed one after the other every 2 s in the determined stroke 
order (Figure 2, middle). This resulted in two full cycles of order 
presentation. The same procedure was adopted in the dynamic 
condition. In addition, however, each superimposed black stroke 
was presented as dynamically unfolding from its determined start- 
ing point to its end point during the 2 s (Figure 2, right) in the 





middle of the screen. Each character was framed by a thin gray 
rectangle. The blank interval between two characters was 500 ms. 
The whole set of characters was repeated three times. 

At test (old/new recognition task), the learned characters were 
presented again one after another in static form. In one half of the 
test trials, there was a slight change of a character’s shape which 
could include shortening or lengthening of one of the strokes as 
well as a change of one of the stroke’s orientation. Prior to the 
testing phase, participants received detailed instructions for the 
old/new recognition task on the screen. An example character— 
not included in the learning set but seen in Figures 1—4 of this 
article—was introduced first, and possible changes in configura- 
tion for this character were subsequently explained. Participants 
were then given five sample trials in which the same character was 
shown in its original form (requiring an old decision) and five 
sample trials in changed form (requiring a new decision). On each 
trial, participants had to indicate whether the original character had 
changed, receiving accuracy feedback to check whether they had 
understood the instructions. After the training trials, participants 
had the opportunity to repeat the example and to ask questions 
before starting the main testing session. The testing session ended 
with handing out the compensation to the participants and giving 
a final briefing about the goals of the experiment. 

Results and discussion. An analysis of variance (ANOVA) 
was performed on arcsine-transformed proportions of correct de- 
cision (Table 1, left) with visualization format as fixed factor. The 
results showed a significant main effect for visualization format, 
F(2, 81) = 3.47; p < .05; y5 = .08. Planned linear contrasts 
revealed a significant difference between the static and sequential 
group (p < .05; Cohen’s d = .67), with the static group outper- 
forming the sequential group (M = .76 vs. M = .70 recognition 





Figure 2. Visualization formats in Experiment 1. Left: Static condition (all strokes are drawn at the same time); 
Middle: Sequential condition (complete third stroke is currently drawn); Right: Dynamic condition (third stroke 


is drawn dynamically). 
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accuracy). The difference between the static and the dynamic 
group (p = .22), and the difference between the sequential and 
dynamic groups (p =.64) did not reach the significance criterion. 
The superiority of static pictures over sequential pictures was to be 
expected under the task-appropriateness hypothesis. 


Experiment 2 


In Experiment 2, participants were required to learn the drawing 
order of strokes. The change from learning stroke configurations to 
learning the order of the strokes made it necessary to adapt 
learning material slightly as the static condition in Experiment 1 
did not contain any order cues. Therefore, a label indicating the 
position of a stroke within the character’s stroke order was drawn 
halfway inside each stroke (Figure 3). According to the task- 
appropriateness hypothesis, the sequential visualizations should be 
superior to the static and dynamic visualizations. 

Participants. Seventy-four female and 23 male students be- 
tween 18 and 43 years of age (M = 23.3, SD = 4.2) were recruited 
from the institute’s participant pool. Thirty-three participants each 
were assigned to the static and sequential formats, while 31 partici- 
pants were assigned to the dynamic format (static: M = 23.7 years, 
SD = 3.9, nine male participants; sequential: M = 23.5 years, SD = 
5.0, seven male participants; dynamic: M = 22.7 years, SD = 4.2, 
seven male participants). Participants with self-reported knowledge of 
Chinese characters (e.g., Chinese and Japanese learners) were ex- 
cluded from the experiment. 

Apparatus and procedure. The procedure was essentially the 
same as in Experiment 1 with the exception that we added labels to 
the characters in all conditions, indicating the order in which the 
strokes had to be drawn (see Figure 3). At test, the learned characters 
(with added labels) were presented again, one after another in static 
form. In one half of the test trials, stroke order between two consec- 
utive strokes was switched (e.g., Stroke 2 and 3 became Stroke 3 and 
2). Participants were requested to indicate whether or not there had 
been a change in stroke order. 

Results and discussion. Due to a recording error, one partic- 
ipant from the dynamic condition and one participant from the 
static condition had to be removed. An ANOVA was performed on 
arcsine-transformed proportions of correct decisions (Table 1, 
middle) with visualization format as fixed factor. The results 
showed a significant main effect for visualization format, FQ, 
92) = 3.25; p < .05; n5 =.07. Planned linear contrasts revealed a 


significant difference between the static and sequential group (p < 
.05; Cohen’s d = .66), with the static group outperforming the 
sequential group (M = .74 vs. M = .66). There was no significant 
difference between the static and the dynamic group (p = .58) or 
between the sequential and the dynamic group (p = .30). The 
superiority of static pictures over sequential pictures was unex- 
pected under the task-appropriateness hypothesis, as this hypoth- 
esis would have clearly favored a sequential presentation format 
for the case of learning stroke order. 


Experiment 3 


In Experiment 3, participants were required to learn the drawing 
direction of strokes. This task requirement made it necessary to 
add directional information to the static and sequential conditions. 
Thus, arrows indicating the directions were added inside each 
stroke (Figure 4). According to the task-appropriateness hypothe- 
sis, the dynamic visualizations should be superior to the static and 
sequential visualizations. 

Participants. Sixty-five female and 25 male students between 
18 and 36 years of age (M = 24.0, SD =4.1) were recruited from 
the institute’s participant pool. Thirty participants were assigned to 
each one of the visualization conditions (static: M = 23.4 years, 
SD = 3.2, eight male participants; sequential: M = 24.5 years, 
SD = 4.2, eight male participants; dynamic: M = 24.1 years, SD = 
4.9, nine male participants). Participants with self-reported knowl- 
edge of Chinese characters (e.g., Chinese and Japanese learners) 
were excluded from the experiment. 

Apparatus and procedure. The procedure was essentially 
the same as in Experiment 1 with the exception that we added 
arrows inside the strokes in all conditions. These arrows indicated 
the direction in which the strokes had to be drawn (see Figure 4). 
At test, the learned characters (with added arrows) were presented 
again one after another in static form. In one half of the test trials, 
arrow directions were switched (e.g., left-to-right changed to right- 
to-left). Participants were asked to indicate whether or not there 
had been a change in stroke direction. 

Results and discussion. An ANOVA was performed on 
aresine-transformed proportions of correct decisions (Table 1, 
right) with visualization format as fixed factor. The main effect of 
visualization format was nonsignificant, F' (2, 87) = 0.36; p = .70. 
This insignificant result prevents us from drawing conclusions 
regarding the task-appropriateness hypothesis. 





Figure 3. Visualization formats in Experiment 2. Left: Static condition (all strokes are drawn at the same time); 
Middle: Sequential condition (complete third stroke is currently drawn); Right: Dynamic condition (third stroke 
is drawn dynamically). See the online article for the color version of this figure. 
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Figure 4. Visualization formats in Experiment 3. Left: Static condition (all strokes are drawn at the same time); 
Middle: Sequential condition (complete third stroke is currently drawn); Right: Dynamic condition (third stroke 


is drawn dynamically). 


Discussion of Experiments 1-3 


The first three experiments aimed at investigating the task- 
appropriateness hypothesis, which states that for successful learn- 
ing the form of the mental representation resulting from a specific 
visualization format needs to be optimally aligned with the re- 
quirements of a specific learning task. We tested this hypothesis by 
supplying learners with static pictures, picture sequences, or ani- 
mations for learning static, sequential, or dynamic aspects of the 
same learning material. In Experiment 1, we obtained statistical 
evidence that stroke configurations were learned best with static 
visualizations. Experiment 2 showed that also stroke sequences 
were learned best with static visualizations. Lastly, in Experiment 
3 (stroke directions), there was no reliable evidence for a differ- 
ence in learning. 

Clearly, on basis of these experiments, it is difficult to believe in 
the validity of the task-appropriateness hypothesis. While results 
of Experiment 1 (learning stroke configurations) are in line with it, 
it is unclear why this hypothesis would not have predicted a 
superiority of sequential over static visualizations for learning 
stroke order (Experiment 2). With static visualizations, partici- 
pants had to construct the order of the strokes by deriving it from 
the numbers printed on top of the strokes. In contrast, sequential 
visualizations arguably provide this information in a readily avail- 
able format: They encoded stroke order on a temporal dimension, 
as in the reality of writing, and this coding should have theoreti- 
cally promoted learning the order of the strokes. Also in Experi- 
ment 3, a proponent of the task-appropriateness hypothesis would 
certainly have expected the dynamic condition to be superior to the 
other two conditions. 

Alternatively, one might argue that sequential visualizations are 
generally inferior to static visualizations in terms of cognitive 
processing limitations (Sweller, Ayres, & Kalyuga, 2011). This is, 
for example, suggested by the results of Experiment 1 in which 
performance of the sequential group was significantly worse com- 
pared to the static group although character configurations were 
visible as a gray outline in this condition for the entire duration of 
the presentation (see Figure 2). If this is indeed the case, a 
proponent of the task-appropriateness hypothesis might expect the 
differences in performance between the static and the sequential 
condition to be smaller in Experiment 2 than in Experiment 1. This 
is because the higher availability of order cues in the sequential 
condition should have to some extent set off any higher cognitive 
processing costs in this condition. However, an additional 


ANOVA conducted on data from both experiments does not sup- 
port this notion (nonsignificant main effect of experiment: F(1, 
176) = 0.99, p = .32; significant main effect of visualization 
format: F(2, 176) = 5.63, p < .01). 

Although cognitive processing costs might have nevertheless 
contributed to the lower performance in the sequential condition, 
one should also take the performance of the dynamic condition 
into account. From the viewpoint of cognitive processing limita- 
tions, both the sequential and the dynamic conditions were argu- 
ably more difficult to perceive and process than the static condition 
due to the transient nature of pictures and picture sequences 
(Sweller, Ayres, & Kalyuga, 2011; Tversky, Morrison, & Bétran- 
court, 2002). One might argue that dynamic visualizations are 
more transient than sequential visualizations, and in the case of 
learning stroke order and learning stroke configuration, they con- 
tain additional but unnecessary cues about stroke direction. One 
might therefore expect that performance should be even lower in 
the dynamic condition than in the sequential condition. On the 
contrary though, the contrasts between the dynamic condition and 
the other conditions did not reach significance. In fact, the dy- 
namic condition was numerically in-between both conditions. This 
is an unexpected result from the standpoint that cognitive process- 
ing limitations were the cause for the lower performance of the 
sequential group. 

Although one might be tempted to conclude that the results of 
Experiments 1 and 2 generally favor the use of static visualizations 
for learning both static and nonstatic contents in line with Mayer 
et al. (2005), there might be alternative or unidentified cognitive 
factors yet overlooked that determine the usefulness of a visual- 
ization format in learning. One factor that has so far rarely been 
considered is the activity that participants engage in at test. Recall 
that in all three experiments, old/new decisions had to be made 
upon presentation of a static picture which contained the informa- 
tion to be tested (in particular, numbers indicating stroke order in 
Experiment 2 and arrows indicating stroke directions in Experi- 
ment 3). In other words, only the static pictures group received the 
identical visualization format during learning and at test, while the 
other groups were tested with a visualization format different from 
that viewed during the learning session. If one accepts the idea that 
the form of the mental representation of the learned content differs 
depending on the visualization format during learning, one might 
argue that the latter groups were forced to apply additional and 
potentially error-prone cognitive operations possibly related to 


966 SOEMER AND SCHWAN 


comparing incoming information at test with the information 
stored in memory. 

In fact, such an explanation comes close to what recognition 
memory researchers call study-test congruence effects. Using an 
experimental procedure similar to ours but without manipulations 
of task requirements, fundamental research on study-test congru- 
ence effects has shown that recognition memory for static and 
moving images improves if study items and test items match in 
presentation format (Buratto, Matthews, & Lamberts, 2009; 
Lander & Davies, 2007). Adapting the study-test congruence hy- 
pothesis to the present case, one might argue that the match 
between visualization formats during learning and at test was 
responsible for the superiority of the static pictures groups in 
Experiment | and 2. This hypothesis also makes the interesting 
prediction that the sequential pictures group should outperform the 
static picture group if testing is carried out with sequential pic- 
tures—independent of task requirements. We conducted a control 
experiment to investigate the applicability of the study-test con- 
gruence hypothesis for the present case. 


Experiment 4 


The main goal of Experiment 4 was to investigate the study-test 
congruence hypothesis which states that the match between visu- 
alization formats during learning and at test improves performance 
independent of the task requirements. To this end, we selected the 
two visualization formats and tasks for which we had observed 
statistically reliable differences in previous experiments. Thus, in 
Experiment 4, we crossed the factors learning task (stroke config- 
uration or stroke order) and visualization format (static pictures or 
sequential pictures) such that there were four groups of partici- 
pants: Two groups were presented with static pictures and another 
two groups were presented with sequential pictures. For each 
visualization format, one group was required to learn the stroke 
configurations, while the second group was required to learn stroke 
sequences. All four groups were presented with sequential pictures 
at test. The study-test congruence hypothesis predicts that partic- 
ipants in the sequential-pictures group should outperform partici- 
pants in the static pictures group. 


Participants 


Ninety-one female and 29 male students between 18 and 34 
years of age (M = 24.2, SD = 3.25) were recruited from the 
institute’s participant pool, 30 participants being assigned to each 
one of the conditions (learning static/test static: M = 24.1 years, 
SD = 3.23, five male participants; learning sequential/test static: 
M = 24.1 years, SD = 3.70, six male participants; learning 
static/test sequential: M = 24.2 years, SD = 3.26, six male 
participants; learning sequential/test sequential: M = 24.7 years, 
SD = 3.36, 12 male participants). Participants with self-reported 
knowledge of Chinese characters (e.g., Chinese and Japanese 
learners) were excluded from the experiment. 


Apparatus and Procedure 


The material from Experiment 2 was reused for Experiment 4. 
There were no changes to any of the presentation parameters in the 
static and sequential visualization conditions during learning. In 


contrast, the testing was carried out with an adapted version of the 
old/new recognition task in which sequential pictures were pre- 
sented instead of static pictures. The presentation parameters at test 
were identical to the presentation parameters of the sequential 
condition during learning (as in Experiments 1 and 2, the presen- 
tation parameters of the pictures at test had been identical to the 
presentation parameters of the static condition during learning). 
The general experimental procedure was identical to the procedure 
in Experiments 1-3. 


Results and Discussion 


An ANOVA was performed on arcsine-transformed proportions 
of correct decisions (Table 2) with learning task (stroke configu- 
ration or stroke order) and visualization formats (static pictures or 
sequential pictures) as fixed factors. The main effect of visualiza- 
tion format was significant, F(1, 116) = 4.02; p < .05; n3 = .03, 
while the main effect of learning task, F(1, 116) = 1.95; p = .16, 
and the interaction were not, F(1, 116) = 0.00; p = .99. As can be 
seen from Table 2, the sequential pictures group outperformed the 
static picture group in both learning tasks (M = .66 vs. M = .62, 
and M = .68 vs. M = .64, respectively). 

In order to compare performance between visualization formats 
at test, two further analyses were carried out including the data 
from Experiment 1 and 2. In the first analysis, the data from the 
static and sequential groups of Experiment 1 was combined with 
the data from the static and sequential groups learning stroke 
configurations in Experiment 4. An ANOVA was performed on 
arcsine-transformed proportions of correct decisions with visual- 
ization format during learning (static or sequential) and visualiza- 
tion format at test (Static/Experiment 1 or Sequential/Experiment 
4) as fixed factors. This analysis revealed a significant main effect 
of visualization format at test, F(1, 112) = 38.29; p< 01; 45 = 
.25, indicating that testing with static pictures in Experiment 1 led 
to better performance than testing with sequential pictures in 
Experiment 4 (means across learning conditions in Experiments 1 
and 4: M = .73 vs. M = .64, respectively). Furthermore, there was 
an interaction between the two fixed factors, F(1, 112) = 11.14; 
P < .01; np = .09, which is consistent with the hypothesis stating 
that—for the case of learning stroke configurations—static pic- 
tures are better when testing is carried out with static pictures, and 
that sequential pictures are better when testing with sequential 
pictures (means across Experiments 1 and 4 for the “match” vs. 
“nonmatch” conditions: M = .71 vs. M = .66, respectively). The 
main effect of visualization format during learning was nonsignif- 
icant, F(1, 112) = 0.46; p = .5O. 

An analogous analysis was carried out for the stroke order 
learning task by combining the data from Experiment 2 and the 
corresponding two groups learning stroke order in Experiment 4. 


Table 2 
Descriptive Statistics for the Data of Experiment 4 


Stroke configuration Stroke order 








Static Sequential Static Sequential 
M 62 .66 .64 .68 
SD .07 .08 aS als 


N 30 30 30 30 
mS tend. Ae able es _ otelen eon 
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This analysis revealed a significant main effect of visualization 
format at test, F(1, 121) = 13.98; p < .01; m2 = .10, and an 
interaction between the two fixed factors, F(1, 121) = 11.82; p< 
01; 15 = .09. The main effect of visualization format during 
learning was nonsignificant, F(1, 121) = 2.05; p = .19. While 
performance seemed to be generally better when static pictures 
were used at test (means across learning conditions in Experiments 
2 and 4: M = .70 vs. M = .66, respectively), using sequential 
pictures during learning increased performance when sequential 
pictures were also used at test. Again, the additional analyses is 
consistent with the hypothesis that static pictures are better when 
testing is carried out with static pictures and that sequential pic- 
tures are better when testing with sequential pictures (means across 
Experiments 2 and 4 for the “match” vs. “nonmatch” conditions: 
M = .70 vs. M = .65, respectively). 

Overall, the results of Experiment 4 not only further weaken the 
task-appropriateness hypothesis, they also suggest that, indeed, a 
match of visualization formats during learning and at test was the 
crucial determinant for success in learning stroke configurations 
and stroke order in Experiments 1 and 2. We will discuss the 
broader implications of these results in turn. 


General Discussion 


Four experiments were carried out to investigate the interplay of 
visualization format during the learning of the task requirements and 
visualization format at test. The particular focus of Experiments 1-3 
was on the task-appropriateness hypothesis (Lowe, Schnotz, & Rasch, 
2011), which states that learning performance depends on the align- 
ment of the form of the mental representation resulting from learning 
with a specific visualization format and the requirements of a specific 
learning task. According to this view, the more a specific visualization 
format helps to construct a mental representation that is optimally 
aligned with the task, the better performance at test will be. Crucially, 
this relation holds independent of the visualization format at test. 

To test this hypothesis, we crossed the factors task requirements 
and learning content within the same learning material. Partici- 
pants were instructed to learn either the stroke configuration of a 
set of Chinese pseudocharacters (Experiment 1), the sequence of 
strokes of the same set of stroke patterns (Experiment 2), or the 
stroke directions (Experiment 3). In all three experiments, they 
learned the characters with one specific visualization format (static 
pictures, sequential pictures, or animations). It was found that that 
both stroke configurations and stroke order were learned best with 
static pictures (Experiments 1 and 2), while for stroke directions 
(Experiment 3), there was no reliable evidence for a difference 
between the visualization formats. 

On basis of these results, we rejected the task-appropriateness 
hypothesis as this hypothesis would have predicted the superiority 
of static pictures in learning stroke configurations, the 
superiority of sequential pictures in learning stroke orders, and the 
superiority of dynamic visualizations in learning stroke directions. 
Because an alternative hypothesis stating that cognitive pro- 
cessing limitations generally favor the static groups also did not 
satisfactorily explain the results, we carried out a fourth exper- 
iment testing an alternative study-test congruence hypothesis 
predicting that the match between visualization format used 
during learning and the one used at test increases learning 
performance, independent of the task requirements. On theoret- 


ical grounds, this might be the case if the degree of alignment 
between the form of the mental representation and the visual- 
ization format at test determines performance. Crucially, this 
relation holds independent of the specific task requirements. In 
the case of Experiments 1-3, a better match between visualiza- 
tion format used during studying and the one used at test 
enabled participants learning with static visualizations to read- 
ily identify the information presented at test with what had been 
stored in memory, while the other groups needed to apply 
additional and potentially error-prone processes that were pos- 
sibly related to comparisons between incoming information at 
test and information stored in memory. 

This fourth experiment, combined with the data of Experiments 1 
and 2, revealed that learning with sequential pictures was, indeed, 
superior when testing was carried out with sequential pictures, while 
learning with static pictures was superior when testing was carried out 
with static pictures—irrespective of the task requirements. These 
results suggest that an alignment between the mental representations 
resulting from learning with a specific visualization format and the 
information presented at test is a crucial factor determining test 
performance. Furthermore, because our results were obtained with a 
relatively simple yet ecologically valid task, we argue that they come 
close to representing a pure investigation of the effects of matching 
task requirements and visualization formats while also controlling for 
many potential nuisance factors that regularly contaminate studies 
with more complex learning material. Thus, our results point to 
relatively general phenomena that should be acknowledged in instruc- 
tional design and future research in this area. 

One might naturally ask to what extent a study-test congruence 
hypothesis can resolve inconsistencies in the literature on learning 
with dynamic and static visualizations. We would like to argue that, 
although one can find observations consistent with the study-test 
congruence hypothesis (e.g., Imhof et al., 2011), this question is 
difficult to answer without adopting additional post hoc assumptions 
because most of the previous studies have relied on testing procedures 
and learning contents that differ in many aspects from the material 
used during learning. For example, in the study by Lowe, Schnotz, 
and Rasch (2011), participants were required to learn the key stages of 
a kangaroo hop after viewing either an animation or a set of images 
depicting these stages in either a static-sequential or static- 
simultaneous format. At test, participants were required to put into 
correct order a set of static pictures depicting the key stages in the 
hopping cycle. Because the pictures to be put in order were presented 
simultaneously at test, one might straightforwardly predict better 
performance of the static-simultaneous group compared to the other 
groups. This, however, was not observed (instead, the static- 
sequential group outperformed the other groups). Still, one might also 
argue that simultaneous presentation at test did not make the task 
easier for participants in the static-simultaneous group because the 
pictures were scrambled at test—potentially creating a mismatch 
between the information presented (and assumingly stored) during 
learning and the information presented at test. One might continue to 
argue (in post hoc fashion) that the visualization format at test might 
have led the participants to scan the presented information picture- 
by-picture, which might ultimately have favored the static-sequential 
group. As one can see from this brief discussion, a better approach to 
study the effects of study-test (in-)congruence in learning with dy- 
namic and static visualizations might be to control the match between 
the material and procedures used during learning and at test. 
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Our findings have important implications for instructional design. 
First, it is suggested that researchers as well as practitioners will have 
to pay close attention to the testing conditions when investigating the 
usefulness of a certain visualization format. Each format might be 
effective as long as there is a minimum degree of alignment between 
the information learned and the information presented at test. How- 
ever, as previous studies vary widely in the amount and form of 
information presented at test, more systematic investigations over a 
variety of learning contents are needed for a complete understanding 
of how testing procedures interact with the information stored during 
learning. A related and interesting direction for future research would 
be to investigate the boundaries of the study-test congruence effect. 
That is, rather than having an exact match between visualization 
formats during learning and at test, one could investigate different 
degrees of alignment by carefully varying parameters of the visual- 
ization formats during learning and at test. It might turn out that some 
visualization formats allow for constructing more flexible mental 
representations than other formats, thus allowing for larger informa- 
tional discrepancies between learning and test. 

Lastly, one might rightfully ask whether study-test congruence 
effects found with a recognition task can generalize to other tasks such 
as reproduction.’ In the present case, for example, using (manual) 
reproduction of the characters instead of recognition might have 
changed the results with regard to the performance in the dynamic 
condition, as suggested by recent studies showing that procedural- 
manipulative tasks benefit from dynamic visualizations (e.g., Castro- 
Alonso, Ayres, & Paas, 2015; Imhof et al., 2011; Wong et al., 2009; 
also see Sweller, Ayres, & Kalyuga, 2011). Although acknowledging 
the potentially considerable impact of the factor response mode on 
learning outcomes, we think that the results of our experiments do not 
conflict with this possibility, and we see this as an exciting direction 
for future research. 


We thank an anonymous reviewer for this suggestion. 
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Educational software in the form of games or so called “computer assisted intervention” for young 
children has become increasingly common receiving a growing interest and support. Currently there are, 
for instance, more than 1,000 iPad apps tagged for preschool. Thus, it has become increasingly important 
to empirically investigate whether these kinds of software actually provide educational benefits for such 
young children. The study presented in the present article investigated whether preschoolers have the 
cognitive capabilities necessary to benefit from a teachable-agent-based game of which pedagogical 
benefits have been shown for older children. The role of executive functions in children’s attention was 
explored by letting 36 preschoolers (3;9—6;3 years) play a teachable-agent-based educational game and 
measure their capabilities to maintain focus on pedagogically relevant screen events in the presence of 
competing visual stimuli. Even though the participants did not succeed very well in an inhibition pretest, 
results showed that they nonetheless managed to inhibit distractions during game-play. It is suggested 
that the game context acts as a motivator that scaffolds more mature cognitive capabilities in young 
children than they exhibit during a noncontextual standardized test. The results further indicate gender 
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differences in the development of these capabilities. 


Keywords: inhibition, attention, teachable agents, eye tracking, learning by teaching 


Through the introduction of technology in preschools, new 
avenues for facilitating interventions in preschool have opened up 
(Clements, 2002; Huffstetter, King, Onwuegbuzie, Schneider, & 
Powell-Smith, 2010). One important potential is the facilitation of 
school readiness for children who otherwise would be at risk of 
falling behind once they start school because of weak preparatory 
skills, particularly in early numeracy and literacy (Clements, 
Sarama, Spitler, Lange, & Wolfe, 2011; Kendeou, van den Broek, 
White, & Lynch, 2007; Morgan, Farkas, & Wu, 2009). In the 
present study, we investigate the possibilities of introducing com- 
puter games in a revamped approach of the learning by teaching 
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(LBT) paradigm with the use of so called teachable agents in 
preschool. The LBT paradigm reverses the role of the student and 
lets students become teachers. However, the question is whether 
this kind of educational software, that has been proven pedagog- 
ically valuable for schoolchildren, is suitable for children of pre- 
school age. To be able to teach, focus and attention on your tutee 
is crucial and this requires a sufficient development of executive 
control. Furthermore, the preschool is at times a distracting envi- 
ronment with high levels of noise and other perturbations. Thus, 
before investing resources in developing a full-fledged LBT-game 
for preschoolers and launching a longitudinal study to investigate 
learning effects, there are some crucial and more basic questions 
that need to be answered. With this study we have used a scaled 
down version of an LBT-game to investigate preschoolers’ ability 
to inhibit visual distractions. 


Need for Empirically Informed Educational 
Software Development 


The impact of computer usage throughout today’s society has 
also affected preschool curricula in which teaching of basic tech- 
nological interaction and use of computers in education is nowa- 
days encouraged (The Swedish National Agency for Education, 
2011; UNESCO, 2008). Research on technology’s impact on chil- 
dren’s health over the past 30 years has produced divergent results. 
It is suggested that children in the midst of their cognitive devel- 
opment should have minimum technological exposure (Council on 
Communications & Media, 2010). In a review of neuroscientific 
and psychological studies related to children’s exposure to digital 
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media, Howard-Jones (2011) emphasize that we must acknowl- 
edge the factors that lead to detrimental effects on the developing 
brain. He concludes these factors to be (a) violent media content, 
(b) excessive use, and (c) late night use. Studies have shown that 
these factors can, for some individuals, result in attention disor- 
ders, disturbed sleep patterns, visual strain, and even seizures 
(Landhuis, Poulton, Welch, & Hancox, 2007; Page, Cooper, 
Griew, & Jago, 2010). 

However, results pertaining to research on moderate use of 
computers and its impact on young children’s learning and edu- 
cational development present a more pleasant side. Children with 
access to computers at home during preschool age have been found 
to perform better on school readiness as well as motor and cogni- 
tive development tasks even when socioeconomic status is con- 
trolled for (Fish et al., 2008; Li & Atkins, 2004). Computer use in 
early age has also shown positive effects on language acquisition 
(Chera & Wood, 2003; Din & Calao, 2001), social, collaborative 
problem-solving (Cardelle-Elawar & Wetzel, 1995; Muller & Per- 
Imutter, 1985), and learning motivation (Bergin, Ford, & Hess, 
1993; Liu, 1996; for a review on the effects of media use on young 
children’s learning and reasoning, see Lieberman, Bates, & So, 
2009). 

These mixed results leave both preschool teachers and parents 
struggling with how to approach the issue of letting young children 
interact with technology. Ljung-Djarf (2008), in a study of atti- 
tudes toward computers in three preschools in Sweden, found that 
there were three overall attitudes toward computer activities: (a) 
threatening other activities, (b) one of many alternative activities, 
and (c) an essential activity. Preschool personnel tried their best to 
implement computer use in lines with the preschool curriculum. 
However, the choice of computer use was largely left to the child 
and it was mostly utilized through play separate from scheduled 
and structured activities. 

The widespread use of computer-based technology with young 
children necessitates that any educational software delivers what it 
promises. However, the Center on Media and Child Health claims 
most educational video games have not been scientifically tested 
and advises parents to use their best judgment (CMCH, 2008). It is 
firmly believed that computers can be a valuable asset in preschool 
education, especially as a tool to help children who otherwise 
would be at risk falling behind once they start school. For com- 
puters to become powerful educational tools, software develop- 
ment must be informed by educational and developmental research 
on young children, and the resulting products must be subjected to 
empirical investigation. 


Advantages of Intervention in Preschool 


Studies of school readiness have reported large individual dif- 
ferences among children with regard to both literacy and numeracy 
skills (Aunio, Hautamaki, Sajaniemi, & Van Luit, 2009; Jordan, 
Kaplan, Ramineni, & Locuniak, 2009). To ensure preschool chil- 
dren do not lag behind, it is important to consider ways to support 
children and help them overcome potential risks of starting school 
with an initial disadvantage (Denton & West, 2002; Griffin & 
Case, 1997; Locuniak & Jordan, 2008; Risinen, Salminen, Wil- 
son, Aunio, & Dehaene, 2009; Wilson, Dehaene, Dubois, & Fayol, 
2009). The majority of children who enters school with early 
language and math difficulties are low-performers whose deficien- 


cies stem from external factors, such as low socioeconomic status 
(SES) and low exposure and training at home and at preschool 
(Denton & West, 2002; Jordan, Kaplan, Nabors Olah, & Locuniak, 
2006). Without intervention, these children are likely to remain 
low-performers throughout school (Jordan, Kaplan, Ramineni, & 
Locuniak, 2009; Kendeou, van den Broek, White, & Lynch, 2007; 
Mononen, Aunio, Koponen, & Aro, 2014). However, preschools 
are understaffed in many countries and preschool teachers often 
feel overloaded by what is already required from them in their 
everyday activities (Bullough, Hall-Kenyon, MacKay, & Marshall, 
2014). 

Here educational software harbors a potential with respect both 
to scaling-up and enabling intervention with reasonable time in- 
vestment by teachers. Indeed some educational software can be 
used with little instruction, and teachers may be allowed to focus 
on one group of children while simultaneously being sure that 
another group of children is engaged in fun, meaningful activities 
while learning (Praet & Desoete, 2014). However, returning to a 
previous point, the pedagogic quality of much educational soft- 
ware is low. To benefit young children at preschool the educational 
software that is used must be of high quality as well as be proven 
pedagogically valuable for the age group in question. The study 
presented in this article involves a kind of educational software 
game proven educationally valuable for schoolchildren and inves- 
tigates whether it can also be suitable for younger children. 


Computer-Based Learning-by-Teaching 


Educational benefits from LBT have been known since the early 
eighties through the seminal work of Bargh and Schul (1980). This 
paradigm reverses the roles by letting students become tutors to 
teach their peers. In the present article, an explorative study is 
presented that investigates cognitive prerequisites in preschoolers 
with respect to a digital LBT game developed for this age group. 
The reason for this venture is that the LBT paradigm has demon- 
strated great pedagogical advantages for schoolchildren. Children 
who take the role as tutors show an increase in effort compared 
with when they learn for themselves. The effort is evidenced 
through the children spending more time on learning materials and 
also by them analyzing the material more thoroughly (Bargh & 
Schul, 1980; Martin & Schwartz, 2009). This increased effort 
seems to arise from motivational mechanisms (Benware & Deci, 
1984). Working with learning material to teach others seem to 
bring about feelings of responsibility and meaningfulness of the 
task (Bargh & Schul, 1980) leading to positive effects on self- 
efficacy beliefs (Moores, Chang, & Smith, 2006), that is, the belief 
in one’s own competence within a given domain. Self-efficacy 
beliefs in fact turn out to positively correlate with actual accom- 
plishments (Pajares & Graham, 1999). A proposed major factor of 
the benefits of the LBT approach is that it stimulates metacogni- 
tion (Flavell, 1979), in other words, reflective thinking about 
problem-solving and one’s own learning (Schwartz et al., 2009). 

In recent years, digital implementations of the LBT paradigm 
have seen light in the form of educational games involving teach- 
able agents (TA; Brophy, Biswas, Katzlberger, Bransford, & 
Schwartz, 1999). A TA is in essence an artificial intelligence 
algorithm that ensures that the behavior of this digital representa- 
tion of a tutee over time reflects how it is being taught by the 
human student so that the digital tutee indeed appears to learn. This 
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form of pedagogical software, in line with research on the tradi- 
tional form of LBT, has proven powerful for schoolchildren aged 
8 years and upward, both in terms of learning outcomes and 
motivational effects (Biswas, Leelawong, Schwartz, Vye, & The 
Teachable Agents Group at Vanderbilt, 2005; Ogan et al., 2012; 
Pareto, Haake, Lindstrém, Sjédén, & Gulz, 2012). 

This human-to-digital-tutee version of LBT has three unique 
advantages over nondigital LBT: (a) all children can be teachers, 
this includes those that are not naturally inclined to take such a role 
because they either feel less knowledgeable than their peers, or 
because of feelings of low self-efficacy; and (b) the child who 
teaches can automatically be matched with the digital tutee to 
ensure an adequate challenge for each child tutor. To obtain this 
kind of match in human-to-human peer learning is often difficult 
because of that a large difference in competence between tutee and 
tutor results in nonoptimal learning benefits; lastly, (c) no human 
tutee will suffer from a poor tutor, which can occur and be 
experienced as an injustice problem when LBT-inspired pedago- 
gies are used in a group of students. The body of research that 
provides evidence for the educational benefits of the digital LBT 
approach has had a focus on pupils aged between 8 and 14 (Biswas 
et al., 2005; Gulz, Haake, & Silvervarg, 2011; Kim et al., 2006; 
Wagster, Tan, Wu, Biswas, & Schwartz, 2007). Whether the 
benefits of a digital LBT-game can be generalized to preschoolers 
is an open question. In particular, the less developed executive 
functions in preschool children bring about doubt. 

The term “executive functions” is an umbrella term for a mul- 
titude of different cognitive processes that facilitates top-down 
control in individuals (Diamond, 2013) and is a vital component of 
school readiness and academic achievement (Blair & Razza, 2007; 
Borella, Carretti, & Pelegrina, 2010; Zaitchik, Iqbal, & Carey, 
2014). The focus of the present study was on top-down guidance 
or control of attention, more specifically sustained attention and 
inhibition. Sustained attention refers to the ability to remain alert 
and maintain attention on the designated task. To enable such 
focus of attention, one has to be able to suppress elements that are 
competing for attention; this is handled by inhibitory processes. 
Several researchers consider inhibition to be a primary executive 
control function (Burgess, Alderman, Evans, Emslie, & Wilson, 
1998: Garavan, Ross, Murphy, Roche, & Stein, 2002; Norman & 
Shallice, 2000). 

To fully benefit from LBT software that includes a digital tutee, 
children, in their role as teachers, must be able to pay sufficient 
attention to their tutee’s actions and learning (Okita & Schwartz, 
2013). An adequate level of attention and focus retention requires 
a certain developmental level with respect to executive functions, 
such as attentional and inhibitory capabilities. There is an intense 
developmental period of executive functions during preschool age 
(Perner & Lang, 1999) and this suggests that executive functions 
will not be as well developed in 3- to 6-year-olds as compared with 
8-year-olds. Consequently, an educational game based upon the 
idea that preschool children should teach and instruct—and pay 
close attention to—a digital tutee may not necessarily work out 
well. 

Although, a study by Gelman and Meck (1983) showed that 
children aged 3-5 were able to detect errors when a puppet 
performed a counting task, even when the numbers exceeded the 
children’s explicit counting range. The study suggested that the 
children have implicit knowledge of numbers exceeding their 


apparent count limit, but because of performance demands they 
cannot explicate this. By observing someone else counting, the 
children can free up cognitive resources and, therefore, more easily 
reflect upon errors. Thus, this provides good reason for tailoring 
LBT-based games to preschoolers to alleviate cognitive strains. It 
is also important to emphasize that executive abilities are gradually 
developed (Levin et al., 1991; Wellman & Liu, 2004). 

The scientific opinion of young children’s cognitive capabilities 
has repeatedly been revised throughout history. This is usually 
mediated through the introduction of novel methods and tech- 
niques, and more often than not, children turn out to be more 
cognitively able than previously assumed. Surprising results have 
been found in preschoolers’ moral reasoning (Hong, 2003); infants 
appeal to mental states (Onishi & Baillargeon, 2005; Southgate, 
Chevallier, & Csibra, 2010); and young children’s selective atten- 
tion and memory encoding efficacy (Blumberg & Torenberg, 
2003; Markant & Amso, 2014). These results elucidate the fact 
that cognition does not exist in a vacuum. Especially in educational 
environments, skills, and abilities emerge through contextual fram- 
ing that acts as a scaffold for enhancing cognitive behavior. 

Digital learning games can provide this type of contextual 
scaffold as recently shown by Chin, Dohmen, and Schwartz 
(2013). Departing from Piaget’s prevalent claim that 9- to 10-year- 
olds are not developmentally mature to reason about hierarchical 
relations and inheritance in taxonomies, results of their study 
showed that this was only true for traditional learning environ- 
ments. The 9- to 10-year-olds in the study who had an opportunity 
to learn the same content by means of a digital game based on the 
LBT-pedagogy were able to reason about inheritance in taxono- 
mies. A rich and complex digital game targets different levels of 
difficulty as well as different learning goals therefore it is impos- 
sible to know before empirical investigation what aspects of a 
game can be learnt and mastered given different developmental 
levels. This makes it relevant to empirically investigate to what 
extent 3- to 6-year-olds can have the cognitive prerequisites to 
pedagogically profit from LBT software. 


Distractions in Preschools 


The preschool environment is known to be lively with a plethora 
of visual and auditory distractions. In conjunction with less devel- 
oped executive control in preschoolers, this might become a hin- 
drance in introducing computer-based interventions in preschools. 
Visual distractions have long been known to be detrimental to 
preschoolers’ performance on simple metor tasks (Poyntz, 1933; 
Somervill, Hill, White, York, & Hayes, 1978). Computers at 
preschools are normally situated in shared spaces where other 
activities are taking place; game playing might be a shared activity 
or other playing activities might occur around or near the child 
who is interacting with the computer. This implies that distractions 
might be of great concern especially in relation to the use of 
LBT-based games in preschool because players of these games 
need to focus on their digital tutee to be able to reap the benefits 
these games potentially have in store in terms of intervention 
programs in preschool. 


Aim and Research Questions 


Our aim in the present study was to closer examine preschool- 
ers’ distractibility by bringing an LBT-based educational game to 
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a preschool. The following two explorative research questions 
were formulated 
e Are there preschoolers who can sufficiently focus on their 
digital tutee’s actions to inhibit distractions? and if so 
¢ How do their test scores of executive control differ from 
preschoolers who cannot? 

Pretests to determine the preschoolers’ sustained attention and 
inhibition abilities were administered. Subsequently we studied the 
preschoolers’ inclination to be distracted and lose focus on what 
was central in an LBT-based game from a pedagogical design 
perspective. For this study, distractibility is defined as time spent 
gazing at pedagogically irrelevant elements within a time-limited 
window when focus is needed on parts relevant to the digital 
tutee’s display of problem-solving and learning. Visual distrac- 
tions were incorporated into the game in the form of animations to 
measure the effects it might have on the participants’ attention. 
The rationale for using a game to investigate the preschoolers’ 
level of distractibility is an ecological one with the aim to get the 
experiment design as representative as possible to the actual con- 
text of preschoolers interacting with a teachable agent. 


Method 


Participants 


There were 65 children (34 girls, 31 boys) aged 3;1 to 6;3 from 
a preschool in Southern Sweden were given permission through 
written consent forms by their guardians to participate in the 
experiment (70% guardian consent rate). The particular preschool 
was selected because it is situated in a rural area that is represen- 
tative of Sweden with regard to level of education and income 
among its population. In this municipality, 41% of the inhabitants 
have completed higher education compared with 39% of the pop- 
ulation of Sweden. The average income is 298k SEK compared 
with 274k SEK for the average working Swede. We did not 
investigate any variables that might differ between families whose 
children were allowed to participate and families whose children 
were not. Although we cannot exclude the possibility that there 
were differences between the groups, it is thought that it may be 
attenuated by the nationally very small differences of SES in 
Sweden. The preschool houses children from ages 1 to 6 years old 
and the only criteria for children to participate were that they had 
turned 3 years of age. The study was approved by the Regional 
Ethical Review Board of Lund (ref. 2013/111). 


Procedures and Measures 


Each child participated alone in two separate data collection 
sessions; one pretest session about 25 min long and a main test 
session about 15 min long. Data collection was carried out over a 
period of 4 weeks in April 2013; 2 weeks of pretest data collection 
and 2 weeks of main test data collection. Thus, there was a gap of 
2 weeks between the two sessions for each participant. Both 
sessions took place in a room at the child’s department of the 
preschool to which the door could be closed to minimize uncon- 
trollable distractions. During the pretest sessions, the participants 
performed one inhibition and one sustained attention pretest task 
and also played the digital LBT-game without any distractive 
animations to familiarize themselves with the game. The rationale 


for letting participants get familiar with the game before data 
collection of the main task was to make sure that we did not 
measure ‘novelty effects. That is, we wanted to make sure that 
distractive or attentional behavior was not induced from curiosity 
of the game components themselves. In the main task session, the 
participants played the digital LBT-game with the distracting 
visual stimuli. 

Data collection was carried out by one experimenter who was 
present all through the sessions; no teachers were present during 
the sessions. The experimenter spent one day at the preschool 
before start of the study and was introduced to the children for 
them to feel familiar with the experimenter. The preschool served 
lunch at 11;30 a.m. followed by group reading and relaxation time. 
All data collection sessions took place sometime between 10:00- 
11:30 and 13:00—15:00 and teachers were. given the task of asking 
a child, who had been given parental consent, whether she or he 
would like to participate. Thus, no control was exerted upon time 
spacing between the two data collection sessions in favor for the 
children’s individual availability and autonomy. 

First pretest: Inhibition. To measure the ability to inhibit 
irrelevant visual stimuli, an antisaccade task (Hallett, 1978) em- 
bedded in a narrative to appeal to younger participants was used. 
The antisaccade task is an established method of measuring inhi- 
bition of reflexive motor movements (Antoniades et al., 2013; 
Hutton & Ettinger, 2006; Munoz & Everling, 2004). In this study, 
a narrative for the task was created for it to be more easily 
explained to the target participants; otherwise the procedure mim- 
icked those of established tests. The task consisted of 24 trials 
where two apples were shown on either side of a centered diagonal 
cross on the screen. The participant was instructed to imagine that 
the apples belonged to him or her. A cartoon monster was shown 
to the participant and it was explained that this monster would 
appear and eat one of the apples, and that the only way to save the 
other apple was to look at it and avoid looking at the monster. 
Participants were asked to try and save as many apples as they 
could. This task was a test of the participants’ inhibitory skills of 
reflexive motor movement when presented with visual stimuli, and 
is a way to measure the development of executive control with 
regard to inhibition. The task was presented on a computer screen 
and the children’s eye-movements were tracked using an SMI 
RED remote eye tracker sampling at 250 Hz. 

Children under the age of 8 have trouble suppressing reflexive 
saccades toward moving stimuli (Munoz & Everling, 2004). As 
most of the children were unlikely to pass most of the trials, it was 
not deemed meaningful to measure this task in terms of correct and 
incorrect trials. Instead the measure was calculated by using time 
spent avoiding looking at the monster as a fraction of the monster’s 
display time. 

Second pretest: Sustained attention. A traditional go-no-go 
paradigm task (Groot, de Sonneville, Stins, & Boomsma, 2004; 
Robertson, Manly, Andrade, Baddeley, & Yiend, 1997) was ad- 
opted to measure sustained attention. The stimulus was presented 
on a computer screen and an external keyboard was used to capture 
the participant’s response. Five colors were quasi-randomly dis- 
played 15 times each. Each color was displayed for 500 ms and 
separated by a 100 ms mask. The participant was asked to press the 
spacebar of the keyboard each time a new color was shown on the 
screen (60 go-trials) except for when the color was blue d15 
no-go-trials). Before beginning the task, all colors one at a time 
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Figure 1. 





Screen shots of the learning by teaching (LBT) game Bird Hero. Top: A picture of the tree with the 


elevator going up to the bird’s nest. Bottom: The four Game Modes of the game. 


were displayed to the participants and they were asked to name 
them to make sure that participants were familiar with the colors 
and that they did not have any color vision deficiencies that could 
disrupt performance. All participants correctly identified the col- 
ors. The participants were also given a test run of 15 trials after 
which the task began. 

From the total 75 trials, each participant’s final score was 
recorded as (a) hits, that is, the number of times a participant 
withheld pressing the space bar key when the color blue was 
presented; (b) misses, that is, the number of times a participant 
pressed the space bar key when the color blue was presented; (c) 
correct rejections, that is, the number of times a participant pressed 
the space bar key when any other color than blue was presented; 
and (d) false alarms, that is, the number of times a participant 
withheld pressing the space bar key when any other color than blue 
was presented. With these scores, a signal-detection sensitivity 
index—log d'—was calculated (Davison & Tustin, 1978). Partic- 
ipants will have an innate tendency toward being either response 
prone or response aversive that will lead to a biased measure if 
only hits are used. The calculated measure of logd is a means to 
handle this response bias, and was used as the value for Sustained 
Attention during the analyses. Generally, d’* is calculated to 
handle response bias. However, logd is recommended to use with 
tests of less than 100 trials (Brown & White, 2005), this because 
d’ has a tendency to be positively biased for tests with a low 
number of trials (Kadlec, 1999). To handle extreme discriminabil- 
ity (ie., a participant managing to score 100% on either go or 
no-go trials), Brown and White’s (2005) recommendations of 
adding a constant—.5 in this case—to hits, misses, correct rejec- 
tions, and false alarms was adopted. 

Main test: LBT-game with visually distracting stimuli. The 
main task consisted of the participants playing the digital LBT- 
game Bird Hero—developed in JavaScript and HTMLS by An- 
derberg, Axelsson, Bengtsson, Hakansson, and Lindberg (2013). 
The game narrative revolves around a flock of chicks that are 
blown out of their nests and need help to get back. The child helps 
the chicks return home via a lift by pushing lift buttons (see screen 


shots in Figure 1). When a chick presents the number of feathers 
representing the floor it lives on, the child’s task is to match this 
number with one of eight lift buttons presented at the bottom of the 
computer screen. The game consists of four game modes: (a) child 
plays, (b) TA watches while child plays, (c) child guides TA who 
tries to play, and (d) child watches TA play on his own. The four 
game modes are depicted in the bottom part of Figure 1. First, in 
Game Mode 1, the child alone helped the chicks to the correct 
branch by maneuvering the lift panel. In Game Mode 2, the TA in 
the form of a panda introduced himself and asked whether he could 
watch to later on be able to help some birds himself. In Game 
Mode 3, the TA suggested that lift button should be pressed by 
presenting his choice in a thought bubble. The participant decided 
whether the suggestion was correct or incorrect through a binary 
choice by pressing a green tick or a red cross, respectively. These 
binary buttons were presented centered at the bottom of the com- 
puter screen and the TAs thought bubble did not disappear until the 
participant pressed one of these binary buttons. In Game Mode 4, 
the TA played without any help from the participant. Participants 
wore headphones during game play to be able to listen to the TA 
and the birds. 

It is important to emphasize that these game modes are not an 
experimental manipulation but a concept crucial to LBT-based 
games. The Bird Hero game was developed to simulate a fully 
working LBT-game but in a “Wizard-of-Oz”-type implementation, 
that is, without any advanced artificial intelligence incorporated, 
because we are not investigating learning effects in this particular 
study but instead how young children behave with a TA. This is 
thus, as has been expressed above, an ecological rationale. 

Distractibility manipulation. Throughout the game, three 
different distracting visual stimuli were used in the form of ani- 
mations that were irrelevant to game play (see Figure 2): (a) a 
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Figure 2. Screen shots of the three visually distracting animations: Football animation during Game Mode 3 
(left), glitch (middle), and aero plane (right) animations during Game Mode 4. 


football rolling across the grass in front of the TA and the bird, (b) 
an aero plane passing by in the background, and (c) a flickering 
square, symbolizing a program glitch. These animations were 
introduced experimentally to approximate the effects of a noisy 
environment with task-irrelevant stimuli under controlled circum- 
stances to measure their influence on children’s attention. Because 
the task-irrelevant animations were condensed into an eye- 
trackable area, they provided a possibility for measuring distract- 
ibility. The aim was to investigate which participants were able to 
inhibit these stimuli and focus their visual attention on the task at 
hand. The distractive animations were played in Game Mode 3 and 
4 at crucial parts of game play when the child—to pedagogically 
profit from the game—would have to concentrate on the TA. 

In Game Mode 3, where the TA suggests which button to 
press and the child accepts or rejects the suggestion, the football 
rolled passed once on the lower part of the screen as the TA 
presented his suggestion in the thought bubble in one of the 
game rounds. This animation played for 3 s. The animation was 
played back when the TA made an action that the participant 
should attend to. More important, in Game Mode 3 the partic- 
ipant is in control of the game and can look at the thought 
bubble any time after the distracting animation has finished. 
The distraction in Game Mode 3 serves the purpose of giving a 
more general view of how distractions affect the participants by 
means of comparing two game rounds where the distracting 
animation is either present or absent. 

In Game Mode 4, in which the child only observed the TA 
playing but was not able to act herself, the glitch flickered in the 
top left corner of the screen just as the TA made his choice on the 
first round (out of two). After the TA had made his choice, the aero 
plane flew past diagonally, entering the top left corner of the 
screen. For the second round, the same two animations were 
played but in reversed order (i.e., aero plane during the TAs choice 
and glitch after the TAs choice). These animations played for 2 s 
each. 

The way the TA made his choice was by moving his hand 
horizontally, from left to right, along the eight lift buttons at the 
bottom of the computer screen. Once he reached the end of the 
screen, he moved his hand back from right to left and made his 
selection. His hand then continued all the way to the left and the 
hand moved horizontally once more from left to right and back 
again and exited the screen on the far left. The TAs hand move- 
ment across the screen took 2 s. The reason why the TA moves his 
hand along the lift buttons twice is so that when the two counter- 
balanced animations are played—during and after the TAs 
choice—the TAs hand is situated at the same spot to make the two 


conditions as visually similar as possible with the only difference 
that a lift button is up or down depending on whether it has been 
pressed by the TA or not. The animations were played 1 s before 
the TA reached the button he was meant to press (during the TAs 
choice) or had recently pressed (after the TAs choice). This is a 
time limited situation where the TA is in charge of the game and 
the child can either attend to the TAs actions or to the distracting 
animations— but not both. An SMI RED remote eye tracker sam- 
pling at 250 Hz was used throughout game play. 

Because the game holds many moving elements—that triggers 
smooth-pursuit eye movements—fixations could not be reliably 
detected. Instead, we used gaze proportions of, or accumulated 
gaze time on, areas of interest (AOIs) calculated from the raw 
sample data. The AOIs were defined as Bird, Lift Buttons or 
Binary Buttons, Distraction, and TA or TA Hand. The gaze time 
spent on the distractive animations was used as a measure of 
distractibility. Comparison between animations before and after 
the TAs choice in Game Mode 4 gave an indication of whether 
children are less inclined to be distracted when the TA displays his 
learnt ability compared to when nothing interesting from a peda- 
gogical perspective is happening on the screen (Research Question 
1). The distractibility measure during the TA choice was used in 
analysis together with the pretest measures to answer whether 
measures of executive function can predict distractibility behavior 
(Research Question 2). 


Results 


Of the original 65 participants, 36 were part of the analysis (20 
girls, 16 boys; Mjge = 5;2;SD = 9 months). The large attrition 
was because of three reasons: (a) for natural reasons, a large part 
of the participants were not at all familiar with numbers and could 
therefore not participate in the main task (18 participants; 
Mage = 4;1); (b) a few participants were reluctant to complete all 
pretests (7 participants); and (c) the eye tracking data were too 
poor for some participants in the main or pretests (4 participants). 
Statistical analysis was performed using the statistical program- 
ming language R (v.2.15.1). 


Pretests Analyses 


The means, SEs, maximum and minimum values of the two pre- 
tests measures as well as age and Distractibility measures are sum- 
marized in Table 1. As expected, the participating preschoolers did 
not perform well on the inhibition task that is in line with previous 
research (Fukushima, Hatta, & Fukushima, 2000). On average the 
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Table 1 
Descriptive Statistics of the Study Variables 








M SD Min Max 
Age (years) LS 74 3.76 6.25 
Distractibility (ms) 198.44 260.92 0. 844 
Inhibition (%) 60.15 16.43 19.04 89.59 
Sustained Attention (log d) 46 DD .09 97 
Hits 8.89 3.02 1 13 
Misses 6.11 3.02 2 14 
Correct Rejections 48.94 8.15 22 60 
False Alarms 11.06 8.15 0 38 


participants managed to completely inhibit the distraction nine times 
in this task out of the 24 trials. However, using the described inhibi- 
tion time fraction measure there were differences revealed across the 
age variable. A statistically significant positive correlation was found 
between age and the Inhibition measure (r = 0.45) while the cor- 
relation between age and the Sustained Attention measure, though 
positive, was weak (r = 0.28). This analysis suggests that the older 
a participant was, the better she or he performed on the pretest tasks. 
A weak positive correlation was also found between the two pretests 
(r = 0.29). Student’s t tests were carried out and did not reveal any 
statistically significant difference between genders with regards to 
Sustained Attention (t = 0.02; df = 34; p = 0.98) and Inhibition 
@ = — 0.35; df = 34; p = 0.73). 


Distractibility Analysis 


The graphs of Figure 3 show two similar time windows of the 
game—just when the TA presents his choice in a thought bubble 
of Game Mode 3—where the difference is that the football ani- 
mation was played as a distraction in the second time window 
(Figure 3B). In both time windows, gaze proportions are averaged 
over the 36 participants. Figure 3C represents a difference graph 
between the two time windows. On average, the participants spent 
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994 ms (SE = 125 ms) of the total 3 s animation playback time 
looking at the distraction (33%). 

The graphs of Figure 4 show the two time windows during (4A) 
and after (4B) the TAs choice in Game Mode 4. Gaze proportions 
are averaged over the 36 participants and consist of the TA helping 
two birds. The majority of the participants did not attend to the 
distracting animations at all during the TAs choice (20 out of the 
36) and only 2 participants attended to both of the animations 
played during the TAs choice. In Game Mode 4, the average time 
of which the participants gazed at the distractions during and after 
the TAs choice was 198 ms (SE = 43 ms; 9.9 %of screen time) 
and 581 ms (SE = 82 ms; 29% of screen time), respectively. 
Having many participants that were not distracted lead to the data 
being skewed and the Distractibility measure had a zero-inflated 
distribution. To handle this, the Distractibility measure was con- 
verted to a dichotomous variable where those who were distracted 
(Distractibility >0 ms) were assigned a 1 and those who were not 
distracted (Distractibility = 0 ms) were assigned a 0. A Yates’ x” 
test revealed a statistically significant difference in attention to the 
distractions during the two time windows before (16 distracted, 20 
nondistracted) and after (30 distracted, 6 nondistracted) the TA 
choice CO” =" 10.17 df =" 1s p =< 001): 

Student’s ¢ tests were carried out to investigate whether there 
were differences between those who were distracted during the TA 
choice from those who were not. This revealed no statistically 
significant differences between these two groups with regard to 
age or performance on the Inhibition and Sustained Attention 
pretests. However, the majority (15 of 20) of the nondistracted 
participants during the TA choice were female that resulted in a 
statistically significant Yates’ y* test between genders Ce 
5.23; df = 1;p < 0.05). 

We used a logistic regression to analyze what pretest and 
participant variables could predict whether a child was distracted 
or not by our manipulation. The dichotomous Distractibility mea- 
sure was used in the analysis against the two pretest measures. Age 
and gender was also included in the analysis since age seemed to 
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Figure 3. Gaze proportion in two similar time windows of four areas of interest over time with (A) and without 
(B) the football distraction in Game Mode 3. Graph C shows the resulting difference from gaze proportions of 
Graph A subtracted from those of Graph B. Duration is the length of the football animation distraction, and 0 
on the x-axis denotes distraction onset. TA = teachable agent. 
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B. After TA Choice 
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Figure 4. Gaze proportion of four areas of interest over time during (A) and after (B) teachable agent (TA) 
choice in Game Mode 4. The time duration is the length of the glitch/aero plane animation distractions, and 0 


on the x-axis denotes distraction onset. 


correlate with the pretests, and also, gender was revealed to have an 
impact on gaze behavior. This analysis revealed statistically signifi- 
cant main effects of Sustained Attention and gender on Distractibility 
(see Table 2). The results suggested that when controlling for age, 
inhibition, and sustained attention, girls were less likely distracted 
than boys (approximately one girl for every nine boys) by our ma- 
nipulations (Dfemaie = — 2.194; odds ratio = 0.111; p < .O1). 
Moreover, higher Sustained Attention pretest scores were as- 
sociated with higher odds of being distracted; for every tenth of 
a log d increase of the Sustained Attention sensitivity index, the 
odds of being distracted increases ninefold on average across 
the range (bgajoga = 4.525; odds ratio = 92; p < 0.05). 
The pseudo R? for the model was 0.254 that is within range 
(0.2-0.4) of a good model fit (McFadden, 1973). A second 
logistic regression analysis was carried out including interac- 
tions between all predictor variables of the first model. No 
significant interaction effects were found. 


Discussion 


The pedagogical power of TAs in learning environments, as a 
digital version of a LBT approach, has repeatedly been shown for 
students aged 8 to 14. In this study we explored the possibility of 
initiating the use of this kind of pedagogical software also in 
preschool. Because of developmental stages of executive functions 
it may be argued that children this young are not cognitively able 
to benefit from such educational games. Furthermore, a preschool 
is a lively environment that would further add to the doubt of 
whether these proposed intervention games would be suitable 


Table 2 
Logistic Regression Estimates for Distractibility 


there. This study addressed two questions: (a) whether there are 
preschoolers who can inhibit distractions to pay attention to a TA, 
and (b) whether experimental measures of inhibition and sustained 
attention can predict the distractibility in these preschoolers. 

As can be noted by the graphs in Figure 3, the distractive 
football animation takes quite a lot of the participants’ visual 
attention in general when the participants are in charge of the 
game. Looking at the difference graph (Figure 3C) it is evident that 
the distraction steals equal amounts of attention from the more 
relevant areas of interest. This can then be contrasted with the 
graphs in Figure 4 that represents gaze proportions on AOIs during 
(4A) and after (4B) the TA makes his choice in Game Mode 4. 
During the TAs choice the gaze proportions of the distractive 
animations drop dramatically. 

These results along with the presented distractibility analysis 
show that this group of preschoolers seem very able to inhibit 
distractions to focus their attention on their digital tutee. The 
participants in this study were in fact so good at this that the 
majority did not look at the visual distractions at all when the TA 
was choosing between numbers. However, after the TA made his 
choice the participants were once again visually occupied by the 
distractions as indicated by the graph in Figure 4B. 

The results suggest that, although everything is kept constant 
between the two conditions, the children were more distracted 
after the TA had made his choice than they were during his 
choice selection. Of interest to the authors, the participants in 
this study did not succeed well in the inhibition pretest but 
nonetheless managed to inhibit during the main task perfor- 
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b SE Pp Odds ratio 
ee ees 
Intercept — 1.260 1.736 .468 — 
Age (centered at M = 5.15) —.460 .636 .470 .632 
Gender (F) —2.194* 855 .010 11 
Inhibition 158 2.821 955 yt 
Sustained Attention (log d) A255 2.248 .044 92.248 


*p < 0.05. 
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mance. This shows the relevance of context and motivation in 
empirical investigations of cognitive capabilities. We suggest 
that the children’s attentional behavior is scaffolded by the 
context (i.e., engagement in a play-&-learn software); thus, they 
performed better in terms of inhibiting distractions than in the 
context of a standard inhibition test. 

By inhibiting distractions, participating preschoolers could in- 
crease their attention on more important features of the game. As 
is shown in Graph A of Figure 4, the preschoolers does focus more 
of their visual attention on the TAs hand and the lift buttons, one 
of which their tutee is about to press, and less on the bird and the 
distraction that are of less importance to benefit from the game. It 
is particularly interesting that the preschoolers keep such focus 
even though they cannot themselves be active in the game in this 
mode (Game Mode 4), they can only observe their tutee’s actions. 
This result corroborates the findings of a pilot study carried out by 
Axelsson, Anderberg, and Haake (2013) where they found that 
preschoolers seem to pay attention to their TA. It also places 
preschoolers together with primary schoolchildren in this respect. 
Lindstrém et al. (2011) showed that primary schoolchildren paid 
close attention to their digital tutee while the tutee was acting on 
its own. In contrast to the preschoolers, however, the primary 
schoolchildren also often showed high engagement in this situa- 
tion. 

The present study also found similar results to those of Roderer, 
Krebs, Schmid, and Roebers (2012) with regard to distractibility 
and engagement. In their study of selective encoding for learning, 
they found that preschoolers were able to increase attention toward 
relevant stimuli and inhibit task-irrelevant stimuli showing en- 
gagement in task-oriented behavior. Roderer et al. (2012) used 
fairly simple and mainly static information in their study and 
concluded that their results were potentially dependent upon their 
operationalization. However, with the results of the present study, 
preschoolers seem able to increase attention toward relevant stim- 
uli also in TA-based learning environments that are more visually 
complex and narratively elaborated. Hence, these studies together 
show that preschool children are not as susceptive to visual dis- 
tractions as one might believe, which further suggests that children 
are able to filter out distractions when their interest and focus lay 
elsewhere. 

In regard to the second research question, the results showed 
that the measure of sustained attention appears to be a predictor of 
distractibility. Although, the results are reversed as to what one 
would expect. Participants that performed well on the sustained 
attention task were more distracted during the TAs choice. This 
result was surprising. One possible explanation could be related to 
the lack of inhibition. The children who participated in this study 
were shown to have poor inhibitory skills as suggested by the 
results of the antisaccade task. Thus, it seems that the Sustained 
Attention measure captured some other aspect of attention in these 
participants—in relation to the distractibility measure—since mo- 
tor inhibition is required also for the go-no-go paradigm task used. 
Our interpretation of the result is that the measure seems to have 
been related to the children’s more general attentional abilities, 
that is, their tendency to notice changes in their environment. That 
would mean that a child who is well able to detect whenever the 
screen color is blue in the sustained attention task will also be more 
likely to notice the visual distractions. This could suggest that 
overall attention to changes in the environment also leads to being 


more distracted unless inhibitory capabilities have matured. Thus, 
when it comes to the participants that were not distracted at all, 
another factor must account for them being able to inhibit—or 
more likely filter out—the distractions. 

An unanticipated result was that the female participants were 
less likely to attend to the distracting visual stimuli. Similar 
results have however been found in previous studies where boys 
have been found to score higher on distractibility measures 
(Bridges, 1929; Victor & Halverson, 1975). Poyntz (1933) 
found that even though boys responded to distractions more 
frequently, they did not spend more time being distracted than 
girls. Although results in the present study conversely showed 
that boys were on average more distracted, it is important to 
emphasize that the overall mean time for attention to the dis- 
tracting visual stimuli during the TA choice was less than 200 
ms (10% of distraction screen time). Thus, even if a participant 
was distracted, regardless of sex, he or she was not distracted 
for long and quickly retained his or her attention to the TA. The 
results of a recent study of metacognitive reasoning in pre- 
schoolers showed that girls were more inclined to play another 
round with a TA when asked than were boys (Haake, Axelsson, 
Clausen-Bruun, & Gulz, 2015). A previous study (Robertson, 
Cross, Macleod, & Wiemer-Hastings, 2004)—including 60 
somewhat older children (10-12 years old) who got to use an 
educational software support in either a TA or a non-TA ver- 
sion—showed that girls tended to interact more in the TA 
version than the non-TA version, whereas the pattern was 
reversed for the boys. Thus, there might be some motivational 
aspects to the digital LBT concept in general that allows girls to 
be slightly more focused and engaged than the boys. 

Results from the present study suggest that the preschool age is 
the point where important cognitive capabilities for benefitting 
from the use of LBT games are forming. These capabilities are 
fairly heterogeneous in this young age group. However, the LBT 
game context has through this study been shown to be of practical 
use for scaffolding mature behavior for some preschoolers com- 
pared to what abstract behavioral tests would suggest. Theoreti- 
cally, this implies that children with underdeveloped inhibitory 
skills might still be able to attend to LBT-based software. 


Study Limitations 


Sample size. The large participant attrition in this study was 
not anticipated. Working with a young target population is difficult 
and requires a large participant marginal, so does working with 
eye-tracking because of difficulties in retrieving reliable data be- 
cause of calibration difficulties and tracking loss. Moreover, list- 
wise deletion of participants unfamiliar with numbers had to be 
used. This leaves the results of the present study vulnerable to only 
being relevant to children at the higher end of the skill spectrum of 
executive control. In future LBT studies with this age group, the 
number of participants needs to be increased. Furthermore, if a 
familiarization with numbers is required, the attrition of partici- 
pants has to be estimated and accounted for to ensure strong 
statistical power. Smaller studies could consider increasing the 
minimum age to handle attrition but this will limit the generaliz- 
ability of results. 

Learning effects. Our research in the present study has been 
guided by the question whether preschoolers can profit from 
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LBT-based games rather than do they profit. This limits us in terms 
of being able to say anything about learning effects with regards to 
preschoolers playing LBT-based games—in this case with respect 
to number sense and early math. However, preliminary results 
from a follow-up study show evidence that the LBT-based play- 
and-learn-game used in this study seems to have a positive impact 
in terms of early math learning gains (Gulz, Londos, & Haake, 
2015). 

Limited SES range. Because the Swedish population devi- 
ates very little in terms of SES levels, and the fact that SES levels 
in Sweden are fairly high, our study will have little to say about 
whether the cognitive prerequisites needed for LBT-based games 
are sufficient for children brought up in lower SES circumstances. 
Replications of this study of children in low SES areas as well as 
cross-cultural studies would be needed to draw any such conclu- 
sions. 


Future Research 


From the results of the present study, it seems reasonable to 
pursue research and development with respect to educational 
LBT-based software for preschoolers. The results of the study 
open up several future research lines. The results indicated that 
girls might benefit more from this pedagogical form~and 
whether this is true must be further investigated. In any case, 
the display of mature cognitive behavior of some of the pre- 
schoolers in this study shows great potential for the develop- 
ment of educational tools for exercising and training of pre- 
schoolers’ metacognitive reasoning. 

The software developed in the work of this study will be utilized 
as a research instrument in combination with other methods in 
future investigations. One objective is to find out to what extent 3- 
to 5-year-olds feel responsible for their tutee and at what stage the 
ego-protective buffer—that is, the sharing of responsibility for 
mistakes and errors by attributing them partly to the tutee and 
partly to oneself—comes into play (Chase, Chin, Oppezzo, & 
Schwartz, 2009). Another future objective is to investigate whether 
it is possible to further the development of theory of mind and 
metacognition in preschoolers through the use of emotional dis- 
play in TAs. 


Conclusions 


The present study shows that the paradigm of learning by 
teaching implemented with teachable agent based educational 
games could possibly be used with much younger children than 
one would have thought since some of the participants in the 
present study possessed the prerequisites to be able to benefit from 
LBT-based games. Three to 6 year old children who do not have 
mature skills at inhibiting attention to distractions can nonetheless 
do so when paying attention to a digital tutee they are responsible 
for helping. This shows that the context or task (the latter always 
partly defined by context or nature of the activity) influences the 
attentional skills of these young learners. 

In conclusion, though the study suffers from some obvious 
limitations that affect its generalizability with regard to the results, 
it does show that there at least are young children that have the 
cognitive prerequisites to be able to play learning-by-teaching- 
based games. Even if not all children are able to play these games, 


they can be made available as soon as the child is ready. Further- 
more, software games have the great potential of being individu- 
ally customizable to a broader audience compared to conventional 
teaching methods. 
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Executive Functions as Moderators of the Worked Example Effect: When 
Shifting Is More Important Than Working Memory Capacity 


Matthias Schwaighofer, Markus Biihner, and Frank Fischer 


Ludwig-Maximilians-Universitat Mtinchen 


Worked examples have proven to be effective for knowledge acquisition compared with problem solving, 
particularly when prior knowledge is low (e.g., Kalyuga, 2007). However, in addition to prior knowledge, 
executive functions and fluid intelligence might be potential moderators of the effectiveness of worked 
examples. The present study examines the roles of the executive functions of working memory capacity 
and shifting, as well as the role of fluid intelligence for knowledge acquisition in the presence or absence 
of worked examples. Seventy-six university students learned to solve statistical problems either with 
worked examples or through problem solving (the absence of worked examples). Results showed that 
shifting and fluid intelligence, but not prior knowledge and working memory capacity, moderated the 
effect of the presence of worked examples on knowledge acquisition. The higher the shifting ability and 
fluid intelligence were, the lower was the benefit of worked examples compared with problem solving. 
Learning environments did not differ with respect to cognitive load, and cognitive load was not correlated 
with working memory capacity, but it was correlated with fluid intelligence. These findings suggest that 
other important cognitive functions, such as shifting and fluid intelligence, might be more important than 
prior knowledge or working memory when worked examples are compared with problem solving. Future 
research can further examine whether the relative contribution of the different functions is likely to 
depend on the characteristics of the respective tasks—that is, whether a task puts a high or low demand 


on these cognitive functions. 


Keywords: executive functions, fluid intelligence, worked examples, knowledge acquisition 


Worked examples have been shown to be more effective than 
unguided problem solving in learning mathematics (Carroll, 1994) 
and statistics (Leppink, Paas, Van Gog, van der Vleuten, & Van 
Merriénboer, 2014; Paas, 1992). A worked example consists of a 
problem formulation, some solution steps, and a final solution 
(Schwonke et al., 2009, p. 259). Some factors have been proposed 
to moderate the effectiveness of the presence of worked examples 
on knowledge acquisition. This article addresses the question of 
whether executive functions and fluid intelligence, as largely un- 
considered factors, in addition to prior knowledge, have a moder- 
ating influence on learning. After an introduction to research on 
worked examples, prior knowledge and executive functions are 
presented as potential moderators of the effectiveness of worked 
examples. 


Effects of the Presence of Worked Examples on 
Knowledge Acquisition 


Studies that investigated the effectiveness of worked examples 
typically have measured learning outcomes with tests assessing 
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different kinds of knowledge, depending on the content to be 
learned. In the domain of statistics (Leppink et al., 2014; Paas, 
1992) and other domains such as electrical circuits troubleshooting 
(van Gog, Paas, & Van Merriénboer, 2006) or mathematics (Car- 
roll, 1994), worked examples have been shown to be beneficial 
with respect to the acquisition of knowledge that can be applied to 
new problems. For example, in the study by Paas (1992), partic- 
ipants had to apply knowledge about measures of central tendency 
to given problems. In a typical problem, participants were required 
to identify the important numbers to compute a mean level knowl- 
edge of statistics. Computation required students to apply their 
individual levels of knowledge. Worked examples effectively fos- 
tered the acquisition of knowledge. Students were able to apply 
this knowledge to problems of near, and even far, transfer (Paas, 
1992). From a theoretical perspective, this seems plausible because 
the solution steps presented in a worked example delineate impor- 
tant aspects of a problem and how to apply appropriate knowledge 
to solve the problem. In other words, worked examples are 
particularly useful to acquire knowledge to identify relevant infor- 
mation of a given problem and knowledge that can be applied to 
reach a problem solution. Thus, worked examples can be consid- 
ered as effective scaffolds for the acquisition of application- 
oriented knowledge. We use the term application-oriented knowl- 
edge to describe the knowledge necessary to identify relevant 
aspects of a problem as well as knowledge that are applicable to 
solve the problem. 

Notably, beyond application-oriented knowledge, studies on 
worked examples indicated that acquisition of conceptual knowl- 
edge also can be fostered through this instructional measure 
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(Schwonke et al., 2009, 2007). Conceptual knowledge can be 
characterized as a network in which information elements are 
interconnected rather than isolated (Hiebert & Lefevre, 1984). 
Conceptual knowledge includes “knowledge about facts, concepts, 
and principles” (de Jong & Ferguson-Hessler, 1996, p. 107). In 
contrast to application-oriented knowledge, conceptual knowledge 
cannot necessarily be used directly to identify relevant aspects of 
a problem and cannot necessarily be applied to solve a problem. 
With respect to the solution of certain problems, conceptual 
knowledge can be relevant when such solutions require retrieval of 
interconnected concepts. For example, to test the assumptions of a 
one-factorial analysis of variance, student learning statistics need 
to know that the dependent variable has to be measured on a metric 
level and that the values of all learners have to be normally 
distributed in all conditions. Obviously, the concepts of the depen- 
dent variable’s measurement level and distribution are intercon- 
nected. 

Worked examples may particularly support the acquisition of 
conceptual knowledge when the solution steps include conceptual 
knowledge elements that go beyond the information that is acces- 
sible in a problem-solving condition. In contrast, much of the 
conceptual information needed to solve a problem can be provided 
in the problem-solving condition and the worked examples; in 
other words, worked examples contain limited amounts of concep- 
tual information beyond the problem-solving condition. Worked 
examples and the comparison condition may largely overlap con- 
cerning instructional support, making a benefit through worked 
examples unlikely (see also Schwonke et al., 2009). To summa- 
rize, there is evidence that worked examples particularly foster the 
acquisition of application-oriented knowledge. In addition, worked 
examples can foster the acquisition of conceptual knowledge under 
certain conditions. 


Potential Moderators of the Effectiveness 
of Worked Examples 


Prior Knowledge 


In the worked examples studies, prior knowledge has often been 
included as a potential moderator of the effectiveness of worked 
examples learning. Worked examples are more effective for 
knowledge acquisition when prior knowledge is low. In contrast, 
unguided problem solving is more effective for learners with more 
extensive prior knowledge (e.g., Kalyuga & Sweller, 2004). Ac- 
cording to the literature, worked examples can even be detrimental 
for more advanced learners, a phenomenon called expertise rever- 
sal effect (for an overview, see Kalyuga, 2007). 


Executive Functions 


Other prominent basic cognitive functions and fluid intelligence 
have rarely been measured in these studies. Recent discussions 
have raised the question of how basic cognitive function constructs 
from cognitive and developmental psychology relate to the con- 
structs of intelligence (e.g., Friedman et al., 2006). Basic cognitive 
functions are cognitive functions including “executive functions, 
long-term memory, as well as processing speed” (Andersson, 
2010, p. 117). Executive functions have received special attention 
in recent years because they are essential for self-control or self- 


regulation (Miyake & Friedman, 2012). Studies have shown that 
self-control ability has broad consequences for daily life, including 
learning (e.g., Mischel et al., 2011). 

The three fundamental executive functions frequently consid- 
ered in the literature are (a) shifting, (b) updating and monitoring 
information in working memory, and (c) inhibition (Best, Miller, 
& Naglieri, 2011; Miyake et al., 2000). These functions “make 
possible mentally playing with ideas; taking the time to think 
before acting; meeting novel, unanticipated challenges; resisting 
temptations; and staying focused” (Diamond, 2013, p. 135). A 
growing body of research suggests that working memory and 
shifting as executive functions are especially relevant for a broad 
range of cognitive achievements, including learning (e.g., Yeniad, 
Malda, Mesman, van Ijzendoorn, & Pieper, 2013; Yuan, Steedle, 
Shavelson, Alonzo, & Oppezzo, 2006). Updating is strongly re- 
lated to performance in the operation span task that measures 
working memory capacity (Miyake et al., 2000). 

In contrast to shifting and working memory capacity, discus- 
sions of inhibition as an executive function or as a distinguishable 
construct have been controversial. On the basis of structural equa- 
tion modeling, several studies (for an overview, see Miyake & 
Friedman, 2012) have demonstrated that when a common factor is 
extracted from the factors representing inhibition, updating, and 
shifting, no unique variance is left for inhibition. Furthermore, a 
growing body of evidence suggests that inhibition is a multidi- 
mensional construct (e.g., Krumm et al., 2009; Nigg, 2000), in that 
several statistically separable inhibition-related abilities exist (e.g., 
Hedden & Yoon, 2006). Thus, we further elaborate only on the 
executive functions of working memory and shifting. 

Working memory. Working memory is involved when infor- 
mation must be stored and additionally manipulated (Baddeley, 
Allen, & Hitch, 2011). Working memory is considered to be an 
important factor for learning because it provides a mental work- 
space in which information can be held while the learner is 
mentally engaged in other relevant activities (Gathercole & Allo- 
way, 2007). The capacity of working memory is associated with 
important cognitive achievements, such as problem solving (e.g., 
Biihner, Kroner, & Ziegler, 2008) and reading comprehension 
(Daneman & Merikle, 1996). 

The superiority of worked examples regarding knowledge ac- 
quisition has been associated theoretically with a reduced cogni- 
tive load in working memory in learning environments based on 
worked examples compared with those based on unguided prob- 
lem solving. Performance such as knowledge acquisition is often 
related to the effects of cognitive load. Research on multimedia 
learning as well as research on cognitive load theory assume that 
a reduced cognitive load is causally related to improved knowl- 
edge acquisition (Briinken, Plass, & Leutner, 2003; de Jong, 2010). 
According to Sweller (2011), worked examples reduce the number 
of interacting elements that have to be processed in working 
memory and, thus, cognitive load. 

Cognitive load theory suggests that the limits of working mem- 
ory are crucial for instructional design decisions (e.g., Sweller, 
2011). Learners for whom the requested working memory capacity 
in a learning environment is higher than their available working 
memory capacity usually experience overload. The prerequisite is 
that learners have to work on tasks under time pressure and cannot 
off-load working memory (e.g., by writing down intermediate 
results; see de Jong, 2010). 
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Van Gog and Rummel (2010, p. 160) argued that learners with 
a lower working memory capacity might benefit especially from 
worked examples. Building on their argumentation, worked exam- 
ples could be assumed to be Jess effective for learners with a high 
working memory capacity, compared with those with a low work- 
ing memory capacity, because the cognitive load in a condition 
without worked examples (e.g., unguided problem solving) would 
be less likely to induce the overload in working memory. Hence, 
a moderating role of working memory capacity for knowledge 
acquisition seems plausible. Interestingly, only a limited number 
of studies have so far tested this possibility directly. In particular, 
de Jong (2010) stated that measures of working memory capacity 
are generally not included in cognitive load research; he listed a 
few studies that are exceptions (Berends & van Lieshout, 2009; 
Lusk et al., 2009; Seufert, Schiitze, & Briinken, 2009; Van Gerven, 
Paas, Van Merriénboer, & Schmidt, 2002). De Jong also listed one 
study by Van Gerven, Paas, Van Merriénboer, and Schmidt (2004) 
as an exception, but that study did not include any measure of 
working memory capacity. 

In the remaining studies cited by de Jong (2010), working 
memory capacity was included as a control variable (Berends & 
van Lieshout, 2009; Van Gerven et al., 2002), or as potential 
moderator variable of the effects of segmentation (Lusk et al., 
2009) and presentation format (audiovisual vs. visual-only presen- 
tation of learning material; Seufert et al., 2009). However, no 
studies tested whether the effectiveness of the presence of worked 
examples depends on working memory capacity as measured with 
reliable and valid tasks. In addition, working memory capacity was 
measured with only one task in these studies. Thus, there is 
nonspecific variance related to the specific task context. This 
problem can be handled by using several tasks to measure working 
memory capacity or an executive function, respectively (see also 
Miyake & Friedman, 2012). Furthermore, the specific role of 
working memory capacity for learning outcomes, controlling for 
other important cognitive functions such as fluid intelligence and 
shifting, has not yet been investigated in research about the effec- 
tiveness of the presence of worked examples. 

Shifting. Shifting is a basic cognitive function or core exec- 
utive function. It is a potential moderator of the effectiveness of 
worked examples but has not been particularly considered in 
research on example-based learning. Shifting is the ability to 
flexibly switch between different tasks, different strategies, or 
different aspects of prior knowledge (e.g., Friedman et al., 2006). 
It is also termed task switching (Miyake et al., 2000). Shifting is 
necessary when learners have to switch between certain aspects of 
a task (e.g., to use different pieces of information to solve a 
problem). Shifting is important, even for young children, for the 
successful completion of intellectual tasks (for an overview, see 
Best, Miller, & Jones, 2009). 

Two meta-analyses by Yeniad et al. (2013) demonstrated the 
relevance of shifting for academic achievement in children aged 4 
to 7 years. In these analyses, shifting had a significant influence on 
performance in math and reading. In addition, shifting was signif- 
icantly associated with general intelligence. Yeniad et al. con- 
cluded that there are domain-general links between shifting ability, 
general intelligence, and academic skills. In line with the results by 
Yeniad et al., a meta-analysis by Frieso-van den Bos, van der Ven, 
Kroesbergen, and van Luit (2013) found a relation between shift- 
ing and mathematics achievement in children aged 4 to 12 years. 


There are also concerns about the association of shifting and 
mathematics achievement, however. For example, Yeniad et al. 
(2013) did not control for executive functions other than shifting or 
processing speed in their analyses (see Bull & Lee, 2014). Hence, 
it cannot be ruled out that other executive functions or processing 
speed may primarily be responsible for the association between 
shifting and mathematics achievement. Furthermore, the meta- 
analyses by Yeniad et al. and Frieso-van den Bos et al. (2013) 
included only studies with children. Executive functions can be 
separated to a varying extent with age (e.g., Lee, Bull, & Ho, 2013) 
and therefore might play a different role for learning in adulthood. 
Despite these limitations, research demonstrates that individual 
differences in executive functions show a certain degree of stabil- 
ity over long time intervals (Miyake & Friedman, 2012), especially 
from early and late adolescence to young adulthood (Friedman et 
al., 2015). Therefore, it seems plausible to assume that shifting 
could be a relevant cognitive function for learning outcomes even 
beyond childhood. 

In sum, shifting ability seems to be important for learning. 
Shifting has not yet been measured in studies contrasting worked 
examples and unguided problem solving. Solving a complex prob- 
lem may require switching between the problem and relevant 
information to solve it, as well as switching between certain 
aspects of the problem or different documents containing relevant 
information. Individuals with lower shifting ability might struggle 
in a learning environment based on unguided problem solving 
because of high shifting demands. In contrast, worked examples 
provide solution steps and the final solution, so the requirements to 
switch between information to solve the problem and the problem 
itself should be lower. Differences in shifting ability should there- 
fore play a smaller role in learning environments with worked 
examples. 

Fluid intelligence. Basic cognitive functions, particularly ex- 
ecutive functions, may influence higher order cognitive functions 
such as fluid intelligence, but they are more fundamental (Dia- 
mond, 2013). Fluid intelligence is “the ability to reason and to 
solve novel problems” (K6nig, Biihner, & Miirling, 2005, p. 245), 
when prior knowledge plays a minor role for solving the problems 
(Schneider & McGrew, 2012). 

The association between working memory capacity and fluid 
intelligence is especially well established (e.g., Buehner, Krumm, 
Ziegler, & Pluecken, 2006; Engle, Tuholski, Laughlin, & Conway, 
1999). A recent analysis by Redick, Unsworth, Kelly, and Engle 
(2012) found a correlation of .53 between working memory ca- 
pacity and fluid intelligence. Hence, the constructs share a com- 
mon variance but are not identical. Conceptually, working memory 
capacity is a construct based on theories about cognitive architec- 
ture. In contrast, intelligence is based on predominantly implicit 
theories of researchers and their assumptions about how to test 
intelligence (see Oberauer, Schulze, Wilhelm, & SUB, 2005). We 
argue, however, that if the unique influence of either working 
memory capacity or fluid intelligence on learning outcomes should 
be investigated, both working memory capacity and fluid intelli- 
gence must be measured to control for the common variance of 
these variables. 

For several reasons, it seems worthwhile to consider fluid in- 
telligence as a potential moderator of the effects of the presence of 
worked examples on knowledge acquisition. Specifically, fluid 
intelligence seems to be important for learning because it moder- 
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ates knowledge acquisition, for example, with respect to lexical 
knowledge (e.g., Ziegler, Danay, Heene, Asendorpf, & Biihner, 
2012). Notably, the influence of fluid intelligence on learning is 
most evident for novel and complex tasks (Primi, Ferrao, & 
Almeida, 2010). In line with this assumption, a body of research 
indicates that fluid intelligence is predictive for solving complex 
problems—for example, in the domain of math (for an overview, 
see Primi et al., 2010). Learners typically have to decide which 
information is necessary to solve a complex problem. Accordingly, 
the demand on fluid intelligence should be predominantly high in 
a learning environment without worked examples, in which learn- 
ers with relatively low fluid intelligence might be mentally more 
strained. In contrast, learning environments with worked examples 
should put a lower demand on learners’ reasoning abilities because 
the solution steps and the final solution for a problem are provided. 
Thus, fluid intelligence might moderate the effectiveness of the 
presence of worked examples. 

In conclusion, convincing evidence shows that the basic func- 
tions of working memory capacity and shifting, as well as the more 
complex cognitive function of fluid intelligence, are important for 
cognitive achievements in various domains. Direct studies of the 
moderating influences of these cognitive functions in learning 
environments with and without worked examples are scarce, how- 
ever. 


Research Question and Hypotheses 


The relevance of executive functions and fluid intelligence 
for many cognitive tasks has been shown repeatedly. In partic- 
ular, working memory has been found to be important for 
learning (Gathercole & Alloway, 2007). Executive functions 
and fluid intelligence potentially moderate the effectiveness of 
worked examples concerning knowledge acquisition. However, 
this moderating role has not been investigated with state-of- 
the-art measurement of executive functions and fluid intelli- 
gence. The research question for this study addresses this 
research gap: 


Research Question: To what extent do the executive functions of 
shifting and working memory capacity as well as fluid intelligence 
moderate the effect of the presence of worked examples on knowledge 
acquisition? 


Based on research about the effectiveness of worked exam- 
ples, we expected cognitive load to be lower and knowledge 
acquisition to be higher in a learning environment with worked 
examples compared with a learning environment without 
worked examples (problem solving). Based on the expertise 
reversal effect (Kalyuga, 2007), we expected that the higher the 
prior knowledge is, the lower the supporting effect of worked 
examples is compared with unguided problem solving. We 
‘ assumed that working with statistical problems can foster the 
acquisition of different kinds of knowledge. We differentiated 
between application-oriented and conceptual knowledge as 
learning outcomes. 

Specifically, our hypotheses are as follows: 


la. The acquisition of application-oriented knowledge is 
higher in the presence of worked examples than in the 
absence of worked examples (problem solving). 


lb. The acquisition of conceptual knowledge is higher in the 
presence of worked examples than in the absence of 
worked examples (problem solving). 


2. Cognitive load is higher in the absence of worked exam- 
ples (problem solving) than in the presence of worked 
examples. 


3. The benefits of worked examples will be greater for stu- 
dents with low prior knowledge than for students with high 
prior knowledge. 


4a. The benefits of worked examples will be greater for stu- 
dents with low working memory capacity than for students 
with high working memory capacity. 


4b. The benefits of worked examples will be greater for stu- 
dents with low shifting ability than for students with high 
shifting ability. 


4c. The benefits of worked examples will be greater for stu- 
dents with low fluid intelligence than for students with 
high fluid intelligence. 


Method 


Participants 


The sample size was determined based on a power calculation 
with GPower, assuming medium effects of the presence of worked 
examples on knowledge acquisition and medium effects of the 
moderator variables. Participants were 76 German-speaking un- 
dergraduate students from programs in educational science, psy- 
chology, and school psychology at the University of Munich. The 
selection criterion was that participants had a low level of prior 
knowledge in statistics, which is taught in all three programs 
during the first four semesters. Values that were in the lower third 
of possible test values for prior application-oriented and concep- 
tual knowledge were considered as Jow. The mean age of the 
participants was 23.83 years (SD = 5.70). Sixty-seven participants 
were female and nine were male, which represents the gender 
distribution in the respective programs. 


Design 


The independent variable presence of worked examples was 
implemented with two conditions: (a) worked examples, and (b) 
the absence of worked examples (problem solving). Students were 
randomly assigned to one of these two experimental conditions of 
a one-factorial design. 

Participants received either €40 ($43.56) or a certificate of 
participation in an empirical study, which is compulsory for suc- 
cessfully completing the psychology degree programs at the Uni- 
versity of Munich. 


Procedure 


The study took place in a laboratory room at the University of 
Munich. The experiment took about 4 hr and comprised a pretest, 
an intervention phase, and a posttest. The pretest assessed demo- 
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graphic variables, prior application-oriented and conceptual 
knowledge, working memory capacity, shifting, and fluid intelli- 
gence. After an introduction, participants worked on the prior 
knowledge tests, the first page of which asked for demographics 
and had a code to ensure the anonymity of the respondents. 
Participants then worked on three tasks to measure working mem- 
ory capacity, three tasks to measure shifting, and the fluid intelli- 
gence test battery. The pretest took about 160 min. 

The intervention phase and the posttest took place in a second 
session. In the intervention phase, participants read the first sta- 
tistical problem and then indicated their interest and motivation to 
work on problems similar to the one presented. Subsequently, the 
participants worked on three statistical problems in one of the two 
differently structured experimental conditions. After each statisti- 
cal problem, participants indicated their cognitive load on the 
respective scale. We chose the subject of statistics because the 
superiority of worked examples over unguided problem solving 
has been repeatedly shown in such a well-structured domain (e.g., 
Leppink et al., 2014; Paas, 1992). Furthermore, statistics is rele- 
vant in a variety of disciplines, curricula, and situations in daily 
life (Leppink, Paas, Van der Vleuten, Van Gog, & Van Merrién- 
boer, 2013). Statistics is compulsory in the study programs in 
educational science, psychology, and school psychology at the 
University of Munich. 

Participants worked individually on the statistical problems in 
the computer-based learning environment. In both experimental 
conditions, with and without worked examples, participants 
worked on three statistical problems from the subject area of the 
general linear model. The participants saw the statistical problems 
and the information to solve them on laptops via PowerPoint 
slides. Participants could scroll through these slides for informa- 
tion to solve the statistical problems in both experimental condi- 
tions. 

Additionally, only participants in the experimental condition 
with worked examples were able to access slides containing solu- 


tion steps for the statistical problems. These participants were able 
to scroll back and forth through all slides at any time during the 
experiment. In the problem-solving condition, participants were 
able to access information after reading a statistical problem. 
Figure 1 shows the first statistical problem, including the task 
instruction. Figure 2 displays the first part of the corresponding 
information to solve the problem. 

The information was organized similarly to text in a book 
chapter. The statistical problems and the corresponding informa- 
tion were designed to be rather complex. To reach a high degree of 
complexity, the statistical problems and the corresponding infor- 
mation also contained aspects that were irrelevant for the current 
task. 

In both conditions (worked examples and problem solving), 
participants worked on the same statistical problems and corre- 
sponding information. Participants in the worked example condi- 
tion accessed a worked example for each problem before receiving 
the corresponding information. The worked examples following 
each problem consisted of three solution steps, with a highly 
similar scheme for all statistical problems. The first solution step 
showed the study design and/or the independent and dependent 
variables. The second solution step described the appropriate sta- 
tistical method or test for the particular problem. The third solution 
step displayed the assumptions to apply a certain statistical method 
(Problems 1 and 3) and considerations about the causality of 
results (Problem 2). Because of the solution steps provided to 
them, the participants in the worked-example condition did not 
have to search for the relevant information to solve the problems. 

Participants in both conditions were asked to generate self- 
explanations (e.g., Chi, Bassok, Lewis, Reimann, & Glaser, 1989), 
because the advantage of worked examples compared with prob- 
lem solving could diminish if self-explanations were required only 
in the problem-solving condition (Schwonke et al., 2009). Self- 
explanations had to be given on the answer sheet in the form of 
justifications for answers to the statistical problems. Without self- 


Problem 1 


Max Musterhirn, a cognition scientist, wants to find out whether the 
presence of a person has an influence on the effectiveness of a cognitive 
training. He has acquired students of pedagogy and psychology (N = 120) 
which were randomly assigned to two groups. One group completetd a 
cognitive training for 4 weeks alone ina laboratory, the other group 
completed the same cognitive training also for 4 weeks, but under 
supervision of an experimentator. Max measured training performance 
(interval-scaled trait) at the beginning, after two weeks, and at the end of 
the training. Because Max has not yet runa study with the cognitive training 
with a student sample, he is interested in whether the training performance 
of the participants differs between the three times of measurement. In 
addition, he assumes that the effect of the presene of an experimentator is 
not identifiable immediately, but depends on the duration of the training. In 
order to guarantee a correct statistical analysis, Max Musterhirn also wants 
to consider the assumptions for his analysis. How can the Max’s Musterhirn 
be answered statistically? Please justify your answers as far as possible. 


Figure I. Statistical problem including the task instruction (translated into English). 
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Information for problem 1 


There are various statistical methods to compare two means from more than two 
samples. The number of samples to be compared is important as well as whether the 
samples are independent from each other. A one-factorial analysis of variance without 
repeated measurements is appropriate to compare several independent samples that 
are realizations of one factor or one independent variable. in this context, there is a 
one-factorial design because means of the stages of only one factor or independent 
variable are considered. If a sample is investigated on several measurement times, the 
values of the dependent variables on the different measurement times depend on each 
other. In this case, there is also a one-factorial design with one factor for the repeated 
measurement (independent variable). In contrast, if means from combinations of factor 
stages are compared, there is a multifactorial design. For example, if there are three 
independent variables or factors with three stages each, there is a 3x3x3-factorial 
design. Thus, a three-factorial analysis of variance would be used for the analysis. A 
special case is the combination of one factor for repeated measurement and one or 
several other factors. The analysis can be done with multifactorial analysis of variance 
with repeated measurement. If one of the factors in the example with three factorsis a 
factor for repeated measurement (with three stages), differences between the means of 
the combinations of the factors can be found with a three-factorial analysis of variance 


with repeated measurement. 


Figure 2. Part of the information to solve the first statistical problem (translated into English). 


explanation prompts, students in the condition with worked exam- 
ples could have just read the worked examples without using the 
additional information to justify the solution steps. 

The intervention phase lasted about 60 min. After participants 
had worked on all the statistical problems, they had to complete the 
knowledge tests at posttest, which lasted about 20 min in total. 


Measures 


The dependent variables, acquisition of conceptual knowledge 
and acquisition of application-oriented knowledge, were measured 
by pretests and posttests. In addition to measuring prior 
application-oriented and conceptual knowledge, we measured 
working memory capacity, shifting ability, and fluid intelligence 
as potential moderators at pretest. Cognitive load was measured 
during the intervention phase. 

Acquisition of application-oriented and conceptual knowledge 
were each measured by the difference between the score in a prior 
knowledge test and the score in a parallel version of this test 
administered at posttest. The knowledge tests were designed based 
on existing items for exercises or exams from the Chair for 
Psychological Methods at the University of Munich. These exer- 
cises and exams have proven to differentiate among students with 
varying levels of statistical knowledge, with items covering a 
relatively wide range of difficulty levels. The tests included tasks 
measuring knowledge about basic statistics (e.g., scale levels), and 
especially knowledge about the general linear model, which is 
typically taught in undergraduate statistics courses in psychology 
study programs. 

The test questions were intended to capture either application- 
oriented knowledge or conceptual statistical knowledge. Notably, 
the response format of items used to measure these two kinds of 


knowledge differs among prior studies investigating the effective- 
ness of worked examples. For example, in the studies by Paas 
(1992) and Leppink et al. (2014), application-oriented knowledge 
was assessed with open-ended questions. In contrast, Schwonke et 
al. (2009) used several multiple choice questions to assess con- 
ceptual knowledge. Multiple choice questions may be appropriate 
to measure conceptual knowledge because ruling out similar re- 
sponse alternatives may involve distinguishing different concepts 
or facts associated with certain concepts (i.e., conceptual under- 
standing; see also Zepeda, Richey, Ronevich, & Nokes-Malach, 
2015). Multiple choice questions plausibly are less appropriate to 
measure application-oriented knowledge, however. Such questions 
may typically require the recall of interconnected concepts. De- 
signing questions to capture knowledge necessary to identify the 
important aspects of a problem with relevant and irrelevant infor- 
mation is presumably difficult. In addition, creating multiple 
choice questions testing knowledge that must be applied to solve a 
problem (e.g., application of the Bayes’s theorem in the study by 
Leppink et al., 2014) seems to be complicated. 
Application-oriented knowledge tests. Application-oriented 
knowledge to solve complex statistical problems was assessed 
with open questions. For example, in the first question of the 
pretest, participants had to determine the design of a study and 
explain how the research questions could be addressed statistically. 
Determining the design of a study involves applying knowledge on 
factors constituting research designs to identify the relevant vari- 
ables in the statistical problem. To address research questions 
statistically, knowledge on when to apply which statistical analysis 
in a given situation had to be applied to the statistical problem. 
Participants in the application-oriented knowledge tests had to 
identify the effects to be analyzed and choose the appropriate 
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statistical method to calculate the effects. In addition, the partici- 
pants had to briefly justify their answers (e.g., by naming the 
independent variables that constitute a design). In summary, to 
solve the complex problems in the open questions, participants had 
to use application-oriented knowledge to identify variables, iden- 
tify effects to be analyzed, and choose a suitable statistical method. 
To evaluate the open questions, each of the two questions was 
divided into three subquestions. One point was assigned for each 
correct answer on a subquestion, and no point was assigned for an 
incorrect answer. Thus, the maximum number of points for the 
open questions was six. Application-oriented knowledge at post- 
test was assessed with a parallel version of the test just described. 

The Kuder—Richardson-20 coefficient was calculated as a reli- 
ability index. The values were rather low: r, = .49 (prior 
application-oriented knowledge) and 7, = .44 (application- 
oriented knowledge at posttest). The first author of the study and 
one student assistant coded more than 40% of the answers of the 
application-oriented knowledge tests at pre- and posttest. Cohen’s 
kappa was calculated separately for application-oriented knowl- 
edge tests at pre- and posttest. The average values were k =.95 for 
questions assessing application-oriented knowledge at pretest, and 
k = .92 for questions assessing application-oriented knowledge at 
posttest. Therefore, interrater agreement was good. When there 
was disagreement between raters, the respective responses were 
rechecked and incongruities were resolved. 

Conceptual knowledge tests. Conceptual knowledge was as- 
sessed with multiple choice questions. Each of the four questions 
had five answer options. Several answer options could be correct. 
For example, participants had to indicate whether variances have 
to be equal in every subpopulation as an assumption of a two-way 
analysis of variance. To answer this question, participants had to 
retrieve knowledge about whether this assumption was connected 
to a two-way analysis of variance (i.e., whether it was one assump- 
tion of a two-way analysis of variance). Thus, answering the 
multiple choice questions required access to facts, concepts, or 
principles. One point was assigned for each correct answer, and no 
points were assigned for incorrect answers. Therefore, the maxi- 
mum score for the four multiple choice questions was 20. Con- 
ceptual knowledge at posttest was assessed with a parallel version 
of the test to assess conceptual knowledge at pretest. 

The Kuder—Richardson-20 coefficient was calculated as a reli- 
ability index. The values were rather low: r,, = .47 (prior concep- 
tual knowledge) and r,, = .47 (conceptual knowledge at posttest). 
The first author of the study and one student assistant coded more 
than 40% of the answers of the conceptual knowledge tests at pre- 
and posttest. Cohen’s kappa was calculated separately for 
application-oriented knowledge tests at pre- and posttest. The 
average values were k = .96 for questions assessing conceptual 
knowledge at pretest and x = .97 for questions assessing concep- 
tual knowledge at posttest. Thus, interrater agreement was good. 
Again, the respective responses were rechecked and incongruities 
were resolved when raters disagreed. Disagreements resulted when 
coded answers differed from the sample solution by mistake. 

Cognitive load. Cognitive load was assessed with a rating 
scale by Paas (1992), which was translated into German. Its anchor 
points ranged from very, very low mental effort (1) to very, very 
high mental effort (9). The participants had to rate their perceived 
mental effort associated with the work on each statistical problem. 
Unidimensional rating scales have been shown to be sensitive 


regarding differences in cognitive load, and there is evidence for 
the claim that they are reliable, valid, and not intrusive (e.g., 
Gimino, 2002). 

Working memory capacity. Working memory capacity was 
assessed with three different complex span tasks: the automated 
operation span, the automated reading span, and the automated 
symmetry span tasks (Redick, Broadway, et al., 2012). These fully 
computerized, mouse-controlled tasks were used within the soft- 
ware E-Prime (version 2.08.22) to capture the verbal (automated 
reading span, automated operation span) and visuospatial (auto- 
mated symmetry span) modality of working memory. Kane, Con- 
way, Hambrick, and Engle (2007) have suggested that differences 
in performance in complex span tasks reflect differences in 
domain-general executive attention. The three complex span tasks 
have two phases and differ only with respect to the stimulus 
material. In the first phase, information has to be processed and an 
element (a letter in the automated operation span task and the 
automated reading span task, or a black square in the automated 
symmetry span task) has to be memorized afterward. For example, 
in the automated operation span task, participants first have to 
decide whether a mathematical equation was right or wrong, and 
afterward they had to memorize a letter. After a variable series of 
processing-storage sequences, the memorized elements have to be 
recalled. The dependent variable for all complex span tasks was 
the proportion of correctly recalled elements (see Conway et al., 
2005, for different scoring methods). 

We calculated the internal consistencies (Cronbach’s alpha) of 
the complex span tasks by using the method of Kane et al. (2004): 
a = .87 (automated operation span), a = .71 (automated symme- 
try span), and a = .89 (automated reading span). For the statistical 
analyses, we used the mean of the results of the three complex span 
tasks, which were highly correlated (see Table A1). 

Shifting. We assessed shifting with the computerized tasks 
color-shape, number-letter, and category switch (Friedman, Mi- 
yake, Robinson, & Hewitt, 2011; Friedman et al. 2008; for an 
overview, see Miyake & Friedman, 2012). We applied the methods 
described by Friedman et al. (2015) for these tasks. The partici- 
pants worked on the tasks via keyboards on MacBooks using the 
software PsyScope X B51. The Keys D and L were labeled “left” 
(Key D) and “right” (Key L). Participants had to press the left or 
the right key with their index fingers to work on the tasks. The 
space key was used for further handling of the tasks, such as to 
start the task. All shifting tasks consisted of no-switch and switch 
trials. In no-switch trials, participants had to apply the same task 
tule as in the preceding trial. For example, in the color-shape task, 
participants had to repeatedly classify the color or the shape of a 
circle or a triangle. A cue indicated which classification (color or 
shape) had to be performed in no-switch as well as in switch trials. 
In switch trials, the task rule for one trial changed from that of the 
preceding trial. The dependent variable for all shifting tasks was 
the difference between the mean reaction times to give an answer 
between correct switch and correct no-switch trials. 

Data were trimmed as described by Friedman et al. (2008) to 
handle outliers and improve normality. The split-half reliabilities 
(Guttman) for the shifting tasks were r,, = .91 (color-shape), hes 
.90 (number-letter), r,, = .86 (category switch). For the statistical 
analyses, we used the mean of the three shifting tasks, which were 
highly correlated (see Table A2). 
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Fluid intelligence. Fluid intelligence was measured with the 
three subtests of the computerized intelligence structure battery 
(INSBAT; Arendasy et al., 2012): (a) numerical inductive reason- 
ing, (b) figural inductive reasoning, and (c) verbal deductive 
reasoning. In the numerical inductive reasoning subtest, partici- 
pants had to recognize the rule underlying a series of numbers and 
complete the series. The figural inductive reasoning subtest pre- 
sented a 3 X 3 matrix with one empty field. Participants were 
expected to recognize a rule for the symbols in the remaining eight 
fields. Participants then had to choose the correct symbol out of 
several others to complete the matrix. In the deductive reasoning 
subtest, participants had 45 s to draw a conclusion, building on two 
given statements based on five possible answers. The number of 
tasks varied for each subtest, dependent on the performance of the 
participants (adaptive testing). The results of the three subtests 
were transformed into a raw score, which was used as an estimate 
of fluid intelligence. The reliability of each subtest could be preset 
to a = .70 (Arendasy et al., 2012). 

Control variables. Control variables were demographic data 
(e.g., age, sex), number of semesters, interest, and motivation. Age 
was included to control for differences in executive functions (for 
an overview, see Diamond, 2013) and fluid intelligence (e.g., Li et 
al., 2004) across the life span. To account for possible effects of 
gender on outcome variables, gender was included as a control 
variable. Notably, most of the participants were female students, 
making it difficult to detect possible gender effects. Number of 
semesters was included as a variable to be able to control for 
potential differences among students from various stages of their 
studies in knowledge probably not assessed with the prior knowl- 
edge tests. Interest and motivation were measured with the Ques- 
tionnaire of Current Motivation (QCM; Rheinberg, Vollmeyer, & 
Burns, 2001). Interest and motivation were included as control 
variables because these constructs have been shown to be relevant 
for learning outcomes (Rheinberg et al., 2001). 

The seven-stage items of the QCM assessed the scales of inter- 
est, anxiety, probability of success, and challenge. Example items 
were “Having read the instruction, the task seems to be interesting” 
(interest scale); “I am a little bit afraid that I can disgrace myself” 
(anxiety scale); “I think everyone can manage that” (probability of 
success scale); and “The task is a real challenge for me” (challenge 
scale). For each scale, Cronbach’s alpha was a = .80 for 
the interest scale, « = .82 for the anxiety scale, a = .89 for the 
probability of success scale, and a = .52 for the challenge scale. 


Statistical Analyses 


We applied the following conventions by Cohen (1988) to 
classify the correlations: r = .10 (small effect), r = .30 (moderate 
effect), and r = .50 (large effect). 

We used the following lower boundaries for eta squared (Mur- 
phy & Myors, 2004): 7? = .01 (small effect), n* = .06 (medium 
effect), and y” = .14 (large effect). To improve the readability of 
this article, we do not refer to the classification of the magnitude 
of each effect in the Results section. 

For all statistical analyses, we set the alpha level at 5%. For the 
unstandardized regression coefficients in the moderator analyses, 
we reported 95% confidence intervals. 


Results 


Preliminary Analyses 


Meaningful correlations of dependent variables. Point-biserial 
and Pearson correlations were used to check for associations 
between control and dependent variables. Working memory ca- 
pacity was significantly correlated with fluid intelligence, r(76) = 
40, p < .001. Therefore, fluid intelligence was considered as a 
covariate in the moderation analyses for working memory capac- 
ity, and working memory capacity was considered as a covariate in 
the moderation analyses for fluid intelligence. Notably, working 
memory capacity was not significantly correlated with cognitive 
load, r(76) = —.10, p = .38. 

In addition to the significant correlation of fluid intelligence 
with working memory capacity, fluid intelligence was significantly 
correlated with cognitive load, r(76) = —.30, p = .001. The 
correlation between fluid intelligence and cognitive load remained 
significant, controlling for the influence of working memory ca- 
pacity on both variables, r(76) = —.28, p = .01. 

Cognitive load was also associated with prior conceptual knowl- 
edge, prior application-oriented knowledge, and some QCM scales 
(see Table A3). Thus, these variables were included in the analysis 
of covariance to test Hypothesis 2. 

Differences at pretest and correlations with knowledge 
acquisition. We used ¢ tests for independent samples to test for 
differences in dependent and control variables between the two 
experimental conditions at pretest. 

Only the number of semesters differed significantly between the 
experimental conditions, t(74) = —2.49, p = .02. The number of 
semesters was higher in the learning environment with worked 
examples (M = 3.61, SD = 2.13) than in the learning environment 
without worked examples (M = 2.63, SD = 1.15). 

Prior application-oriented knowledge and application-oriented 
knowledge at posttest were correlated, r(76) = .37, p = .001. Prior 
conceptual knowledge and conceptual knowledge at posttest were 
correlated as well, r(76) = .24, p = .03. In addition, age and 
number of semesters were correlated with the acquisition of 
application-oriented knowledge, 7(76) = .23, p = .047, for age, 
and r(76) = .25, p = .03, for number of semesters. We conse- 
quently controlled for age and number of semesters in the analysis 
of covariance used to test Hypothesis 1 and moderation Hypoth- 
eses 3 and 4. 


Effects of the Presence of Worked Examples on 
Knowledge Acquisition and Cognitive Load 
(Hypotheses 1 and 2) 


A two-factorial analysis of covariance with the factors Presence 
of Worked Examples (worked examples vs. problem solving) and 
Time of Measurement (pretest vs. posttest) was used to test 
whether acquisition of application-oriented knowledge depends on 
the presence of worked examples. There was no significant main 
effect for the presence of worked examples, F(1, 72) = 1.38, p = 
24s nb = .02, and no significant main effect for the time of 
measurement, F(1, 72) = 0.20, p = .65, np = .003. In line with 
Hypothesis la, acquisition of application-oriented knowledge de- 
pended on the presence of worked examples, as indicated by the 
significant interaction between the presence of worked examples 
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and time of measurement, F(1, 72) = 5.75, p = .02, "i = 07. 
Acquisition of application-oriented knowledge was higher for the 
learning environment with worked examples (Mogi¢e = 1.53) than 
for the learning environment without worked examples (Maire = 
0.53). 

A two-factorial, univariate analysis of variance with the factors 
Presence of Worked Examples (worked examples vs. problem 
solving) and Time of Measurement (pretest vs. posttest) was used 
to test whether acquisition of conceptual knowledge depends on 
the presence of worked examples. There was no significant main 
effect for the presence of worked examples, F(1, 74) = 0.17, p = 
.68, me = .002, but there was a significant main effect for the time 
of measurement, F(1, 74) = 33.16, p < .001, ne = 31. In contrast 
to Hypothesis 1b, the results do not support the claim that con- 
ceptual knowledge acquisition depended on the presence of 
worked examples, F(1, 74) = 0.56, p = .46, n3 = .01. Table 1 
shows the means and standard deviations of application-oriented 
and conceptual knowledge at pre- and posttest separately for the 
two experimental conditions. Prior application-oriented knowl- 
edge was rather low (in the lower third of possible values) and 
prior conceptual knowledge was relatively high (above the mean 
of possible values). 

A one-factorial, univariate analysis of covariance was used to 
test whether cognitive load depends on the presence of worked 
examples. Descriptively, the cognitive load in the learning envi- 
ronment without worked examples (MV = 6.05, SD = 1.11) was 
slightly higher than the cognitive load in the learning environment 
with worked examples (MV = 5.85, SD = 1.48). However, in 
contrast to Hypothesis 2, this difference was not significant, F(1, 
68) = 0.93, p = .34, np = .01. 


Moderating Role of Prior Knowledge (Hypothesis 3) 


PROCESS (Hayes, 2013) was used for moderation analyses. It 
was controlled, when necessary, for the influence of covariates on 
the independent variable, the moderator, as well as the dependent 
variable. Heteroscedasticity-consistent standard errors were esti- 
mated as suggested by Hayes (2012). We employed the Johnson- 
Neyman technique to quantify the effect of the independent vari- 
able (presence of worked examples) on knowledge acquisition for 
different values of the respective moderator (prior knowledge, 
working memory capacity, shifting, or fluid intelligence). The 
Johnson-Neyman technique allows for the determination of 
whether the independent variable has a significant influence on the 
dependent variable for certain ranges of values of the moderator 
variable. 


Table 1 

Means and Standard Deviations of Conceptual and Application- 
Oriented Knowledge at Pre- and Posttest Separately for the Two 
Experimental Conditions 


Experimental 
Type of knowledge condition Mire (SD ire) Mirost (SD icst) 
Application-oriented Problem solving P6807) 221-36) 
knowledge Worked examples 1.47 (1.35) 3.00 (1.27) 
Conceptual knowledge Problem solving 12.29 (2.42) 14.05 (2.34) 
Worked examples 11.84 (2.94) 14.13 (2.23) 


We did not investigate the moderating role of prior conceptual 
knowledge on acquisition of conceptual knowledge because of a 
small variance in the lower range of prior conceptual knowledge. 
In addition, there was no effect of the presence of worked exam- 
ples on acquisition of conceptual knowledge. Accordingly, we 
investigated only the moderating role of prior application-oriented 
knowledge for the acquisition of application-oriented knowledge. 

In contrast to Hypothesis 3, prior application-oriented knowl- 
edge globally had no moderating role with respect to the acquisi- 
tion of application-oriented knowledge, b = —0.03, 95% confi- 
dence interval (CI) [—0.53, 0.49], p = .94. In the lower range of 
theoretically possible values, the presence of worked examples had 
a significant influence on the acquisition of application-oriented 
knowledge. In this range, the higher the prior application-oriented 
knowledge was, the smaller the effect of.the worked examples was 
compared with problem solving. This result pattern also applies to 
the remaining range of values, for which the effects were not 
significant (see Table A4). 


Moderating Role of Executive Functions and Fluid 
Intelligence (Hypothesis 4) 


The moderating influence of executive functions and fluid in- 
telligence was investigated only for the acquisition of application- 
oriented knowledge because there was no effect of the presence of 
worked examples on conceptual knowledge acquisition. 

Working memory capacity had no moderating influence glob- 
ally on the acquisition of application-oriented knowledge, 
b = —0.47, p = .80, 95% CI [—4.07, 3.14]. There was an area in 
the middle region of theoretically possible values of working 
memory capacity for which the presence of worked examples had 
a significant influence on acquisition of application-oriented 
knowledge. In this region, the lower the working memory capacity 
was, the higher the benefit of worked examples was compared with 
problem solving (see Table A5). 

Shifting moderated the influence of the presence of worked 
examples on acquisition of application-oriented knowledge, b = 
0.004, p = .004, 95% CI [0.001, 0.007]. There was a broad range 
of shifting values for which the presence of worked examples had 
a significant influence on knowledge acquisition (see Table A6). 
The worse the shifting ability was, the greater the effect of the 
worked examples compared with problem solving on acquisition 
of application-oriented knowledge was. 

Fluid intelligence also moderated the influence of the presence 
of worked examples on acquisition of application-oriented knowl- 
edge, b = —0.83, p = .03, 95% CI [—1.57, —0.08]. There was a 
broad range of fluid intelligence values for which the presence of 
worked examples had a significant influence on knowledge acqui- 
sition (see Table A7). The lower fluid intelligence was, the higher 
the effect of the presence of worked examples was on acquisition 
of application-oriented knowledge. 

Table 2 summarizes the novelty of the tested hypotheses (i.e., 
whether the hypothesis has been systematically addressed before) 
and whether they were supported in the data. 


Discussion 


The main finding of our study in the domain of statistics was 
that shifting and fluid intelligence, but not working memory ca- 


EXECUTIVE FUNCTIONS AND THE WORKED EXAMPLE EFFECT 991 


Table 2 


Summary of Results Related to Novelty (i.e., Whether a Hypothesis Has Been Systematically Addressed Before) and Support of Tested 


Hypotheses by the Data 


— 


Novelty of Support of 

Hypothesis hypothesis? hypothesis? 
la. Effect of the presence of worked examples on acquisition of conceptual knowledge = iE 
1b. Effect of the presence of worked examples on acquisition of application-oriented knowledge = a 
2. Effect of the presence of worked examples on cognitive load 7 * 
3. Moderating role of prior knowledge = ti 
4a. Moderating role of working memory capacity of i 
4b. Moderating role of shifting 7 uM 
4c. Moderating role of fluid intelligence ils 3 





pacity, moderated the influence of worked examples on the acqui- 
sition of application-oriented knowledge. The lower the shifting 
ability and fluid intelligence of the learners were, the higher the 
benefit of worked examples was compared with problem solving 
on the acquisition of application-oriented knowledge. We first 
discuss the effectiveness of worked examples for acquisition of 
application-oriented and conceptual knowledge, as well as for 
reducing cognitive load. Then we elaborate on the moderating role 
of prior knowledge, executive functions, and fluid intelligence. 


Effects of Worked Examples on the Acquisition of 
Application-Oriented and Conceptual Knowledge 


The presence of worked examples fostered the acquisition of 
application-oriented knowledge, but had no effect on the acquisi- 
tion of conceptual knowledge. One reason for these results might 
be that learners already disposed of a quite high level of prior 
conceptual knowledge. Worked examples have been shown to be 
particularly effective for learners with low prior knowledge (Ka- 
lyuga, 2007). In contrast, prior application-oriented knowledge 
was rather low. Furthermore, a post hoc inspection showed that 
worked examples conveyed more information for acquisition of 
application-oriented knowledge compared with the problem- 
solving condition. The worked examples informed about the rel- 
evant aspects of the statistical problems and conveyed the right 
knowledge to apply to these problems. In other words, the contents 
learned with worked examples might have been more pertinent to 
acquire application-oriented knowledge than conceptual knowl- 
edge. As noted earlier, worked examples might not be beneficial if 
support for acquisition of the respective knowledge is similar in 
the comparison group (see also Schwonke et al., 2009). This seems 
to apply here because learners in both conditions received the 
additional information and learners in both conditions improved 
their conceptual knowledge. 


Effects of Worked Examples on Cognitive Load 


The small effect of the presence of worked examples on cogni- 
tive load was not significant, maybe because of low statistical 
power. A sample estimation indicated that 1,618 participants 
would be necessary to establish the effect with a power of .80. 
Given that the means of the two experimental conditions differed 
only marginally, the effect seems not to be practically relevant. 

Although there was no effect of the worked examples on cog- 
nitive load, there was an effect on the acquisition of application- 


oriented knowledge. This is in line with several studies that 
showed that instructional support can be effective for knowledge 
acquisition even if it does not reduce cognitive load (e.g., de 
Koning, Tabbers, Rikers, & Paas, 2010; Lusk & Atkinson, 2007). 
The effect of the worked examples on acquisition of application- 
oriented knowledge and the nonsignificant effect of the worked 
examples on cognitive load suggest that the subjective rating of 
cognitive load may not be appropriate to explain differences in 
learning outcomes. 

Furthermore, cognitive load and working memory capacity were 
not correlated. The nonsignificant correlation casts doubt on the 
validity of the rating scale as a measure of cognitive load in 
working memory. In fact, “the validity of the test itself [the rating 
scale] is never questioned” (de Jong, 2010, p. 116). In addition, the 
subjective rating scale and other measures of cognitive load (e.g., 
physiological measures) do not measure cognitive overload. The 
threshold of overload is unknown (de Jong, 2010). Of course, one 
might also question whether the working memory capacity tasks 
used in the present study were of sufficient validity. However, 
these tasks have been shown multiple times to be reliable and valid 
indicators of working memory capacity (Redick, Broadway, et al., 
2012). In addition, concerning the nomological network of cogni- 
tive abilities, the correlation of working memory capacity and fluid 
intelligence was similar to that reported in the literature (e.g., 
Redick, Unsworth, et al., 2012). Thus, it does not seem likely that 
the nonsignificant correlation between working memory capacity 
and cognitive load can be attributed to a low validity of the 
working memory capacity tasks. 

Cognitive load is a multidimensional construct and possibly 
variables other than working memory capacity could influence the 
subjective rating of cognitive load. In the present study, fluid 
intelligence, prior application-oriented and conceptual knowledge, 
and some scales of the QCM were related to cognitive load. 
Opfermann (2008) argued that learning behavior could also have 
an influence on cognitive load, and that learning behavior is in turn 
influenced by the individual characteristics of the learner, such as 
metacognition and attitudes. In sum, the validity of the rating scale 
as a measure of cognitive load in working memory did not receive 
supporting evidence from the present study. 


Moderating Role of Prior Knowledge 


In contrast to our expectation, prior application-oriented knowl- 
edge did not moderate the effect of the worked examples on 
acquisition of application-oriented knowledge. This result can be 
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explained by a low variance of prior application-oriented knowl- 
edge in the lower range of values. Only integer values between 0 
and 6 were possible for prior application-oriented knowledge. 
Many participants had the same low level of prior application- 
oriented knowledge, and only a few participants had medium to 
high levels of knowledge. A moderating role of prior application- 
oriented knowledge could not be established across the whole 
range of values because of too little variability. The resulting 
pattern is not in contrast with the assumption of particular advan- 
tages of worked examples for acquisition of application-oriented 
knowledge for learners with low prior knowledge. The mean 
application-oriented knowledge was low in the sample, favoring a 
benefit of worked examples compared with problem solving (Ka- 
lyuga, 2007). Differences in the means (not variances) of this form 
of knowledge between the two learning environments have been 
found in the analysis of covariance. However, the lack of sufficient 
variability in prior application-oriented knowledge made it diffi- 
cult to confirm that the benefits of worked examples are lower for 
learners with more prior knowledge. 


Moderating Role of Executive Functions 
and Fluid Intelligence 


Working memory. The finding that working memory capac- 
ity had no moderating influence on the effect of worked examples 
on knowledge acquisition is in contrast to Hypothesis 4a. Expand- 
ing the argument made by van Gog and Rummel (2010, p. 169), 
we assumed that worked examples were less beneficial for learners 
with higher working memory capacity because cognitive load 
would be less likely to exceed their working memory capacity. 

Apart from considerations concerning the validity of the rating 
scale to measure cognitive load in working memory (see Effects of 
Worked Examples on Cognitive Load section), another reason for 
the absent effect of worked examples on cognitive load might be 
that learners had no time pressure to solve the statistical problems. 
Therefore, learners in the problem-solving condition might have 
reduced their working memory load by themselves. One possibility 
to reduce working memory load, obviously, is to put more demand 
on the shifting ability. This could be realized by switching more 
frequently between information given by a statistical problem and 
the information given to solve the problem, instead of holding a 
large amount of information given by the problem in working 
memory in order to solve the problem. Thus, the cognitive load in 
the condition without worked examples might have been similar to 
the cognitive load in the condition with worked examples. De Jong 
(2010) noted that in many studies of the cognitive load tradition, 
off-loading is disabled and the situation is artificially time-critical. 

Still another reason exists as to why worked examples might not 
have reduced cognitive load in working memory. Although par- 
ticipants with the worked examples were provided with the solu- 
tion steps, they had to maintain them in working memory when 
they searched for the correct information to give reasons for the 
solution steps. Following the explanation of Ayres and Sweller 
(2005), the affordance to mentally integrate the disparate informa- 
tion presented on separate slides might have put a demand on 
working memory capacity similar to the one in the problem- 
solving condition. In the problem-solving condition, participants 
had to store and process the information to solve the problems. 
Therefore, the present study is inconclusive concerning the ques- 


tion of whether working memory capacity has a moderating influ- 
ence on knowledge acquisition when learning environments differ 
to a larger extent in their working memory capacity requirements. 
Notably, we assumed that domain-general working memory ca- 
pacity assessed with different tasks is not influenced by the prior 
knowledge in statistics. If this assumption holds true, the range of 
values for working memory capacity is not limited by a reduced 
range of prior knowledge levels. 

Shifting. Presumably, solving statistical problems requires 
switching among different kinds of information as well as switch- 
ing between information and the problem and switching among 
certain aspects of the problem. This is in accordance with research 
suggesting that shifting is relevant in complex tasks that require 
switching among certain aspects of a problem (Blair, Knipe, & 
Gamson, 2008; van der Sluis, de Jong, & van der Leij, 2007). 
Learners with a high shifting ability supposedly have less of a 
problem in a learning environment without worked examples 
(problem solving) because they have fewer problems with switch- 
ing between information to solve statistical problems and the 
problems themselves. Participants in the worked-example condi- 
tion saw the solution steps and were not required to consistently 
shift between aspects of a statistical problem and the information 
to solve it. Consequently, the necessity to switch between the 
problem and information to solve the problem could have been 
lowered. Hence, individual differences in shifting ability played a 
minor role in the learning environment with worked examples. The 
benefit of worked examples over the problem-solving condition 
seemed to be attenuated with better shifting ability. 

Fluid intelligence. Fluid intelligence is considered especially 
important for novel and complex cognitive tasks (Primi et al., 
2010). In addition, learners with high fluid intelligence might solve 
complex problems without worked examples because they can 
reason which information is relevant to solve a statistical problem 
more easily than can learners with low fluid intelligence. In line 
with this assumption and Hypothesis 4c, the benefit of worked 
examples over problem solving was reduced with higher fluid 
intelligence. Conversely, the benefit of worked examples over 
problem solving increased with lower fluid intelligence. Learners 
with lower fluid intelligence could be troubled in a learning 
environment without worked examples because reasoning to iden- 
tify the relevant information for solving the problem at hand might 
be particularly difficult for them. Learners with lower fluid intel- 
ligence might have benefitted especially from the solution steps of 
the worked examples in the learning environment with worked 
examples because these reasoning processes were not required. 

In sum, shifting and fluid intelligence, but not working memory, 
moderated the effect of the presence of worked examples on 
acquisition of application-oriented knowledge. Presumably, the 
demands on shifting and fluid intelligence were especially high in 
the learning environment based on unguided problem solving, and 
the worked examples reduced these demands. The reduction in 
these demands was especially beneficial for learners with low 
shifting ability and low fluid intelligence. 


Limitations 


The reliabilities (internal consistencies) of the knowledge tests 
were rather low. Although we differentiated among different types 
of knowledge, the questions for the measurement of the respective 
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types of knowledge differed markedly in terms of content because 
we intended to cover a broad knowledge range about statistical 
contents. The tasks of the knowledge tests were designed based on 
existing practice and exam questions from the Chair for Psycho- 
logical Methods at the University of Munich to achieve a high 
content validity. Although these tasks were carefully developed 
and tested in practice settings, they were heterogeneous with 
regard to the content. Low or high knowledge in a specific content 
area is not accompanied by low or high knowledge in another 
content area, which could explain the low reliabilities of the 
knowledge tests. In line with this argumentation, Wecker et al. 
(2013) noted that it is not atypical in knowledge tests that cover a 
broad content area for the single knowledge elements measured by 
single items to share no strong connections that allow learners to 
infer the knowledge elements from one item to another. 

Despite the low reliabilities of the knowledge tests, there was an 
effect of the worked examples on the acquisition of application- 
oriented knowledge. This effect supports the distinction between 
different types of knowledge because there was no effect of 
worked examples compared with problem solving on the acquisi- 
tion of conceptual knowledge. In addition, learners acquired con- 
ceptual knowledge independently of the presence of worked ex- 
amples. Therefore, the reliabilities of the knowledge tests were 
sufficient to establish an effect of the presence of worked examples 
on the acquisition of application-oriented knowledge and an effect 
of the time of measurement (pretest vs. posttest) on the acquisition 
of conceptual knowledge. The statistical power might have been 
higher if the reliabilities had been better. More similar items could 
have led to more reliable measurements of the different types of 
knowledge. If these items cover contents similar to the existing 
ones, validity should not be reduced. 

Furthermore, the content overlap between the questions mea- 
suring conceptual knowledge and the worked examples was lower 
compared with the questions measuring application-oriented 
knowledge. Additionally, prior conceptual knowledge of the stu- 
dents was relatively high. It remains unclear whether a significant 
benefit of the worked examples on the acquisition of conceptual 
knowledge could be found when the prior conceptual knowledge is 
low and the worked examples contain more information to foster 
the acquisition of conceptual knowledge. 

Another limitation concerns the problem cases from the domain 
of statistics. These problem cases were domain-specific analytical 
problems. The question is unresolved as to whether similar results 
can be found by using other types of problems (e.g., study design 
problems) or problems from other domains, which could put 
different demands on executive functions and fluid intelligence. If 
learning environments with and without worked examples differ 
largely in their demands on working memory capacity, working 
memory capacity should have a moderating role for knowledge 
acquisition. 

Furthermore, the power for the moderation effects of shifting 
and fluid intelligence was low, probably because the moderation 
effects were smaller than expected in the a priori computation of 
the sample size. Larger samples are needed to detect moderator 
effects in the effect sizes found in the present study. 

Finally, the sample of the present study consisted predominantly 
of female students. The generalizability of the findings to samples 
comprising more male students has to be established in future 
studies. 


Conclusions and Future Research 


Depending on the characteristics of the problems to be solved, 
various demands are put on different executive functions. Working 
memory might have a moderating role if learners experience 
overload in a learning environment without worked examples 
(problem solving) and if working memory load is significantly 
reduced by worked examples. In a typical, time-critical scenario, 
learners have to keep much information in working memory to 
solve a problem in a limited amount of time. Thus, overload in 
working memory may be induced. This overload could effectively 
be reduced by worked examples, and learners with a low working 
memory capacity might especially benefit from worked examples. 
Thus, working memory capacity might turn out to moderate the 
effectiveness of worked examples only for a specific type of 
problem to be solved. 

Shifting might be a candidate executive function for problems in 
which a frequent relocation of attention among different kinds of 
information is necessary. An example would be solving a complex 
problem in which the necessary information has to be retrieved 
from multiple sources. 

For integrating different pieces of information and deciding 
which is relevant for the solution of a problem, a general reasoning 
capacity, as measured by fluid intelligence, might be important 
beyond domain knowledge and executive functions. Worked ex- 
amples and problem solving could be contrasted using tasks that 
put either a high or low demand on shifting and/or fluid intelli- 
gence. Shifting and/or fluid intelligence may moderate the effec- 
tiveness of worked examples only when tasks put a high demand 
on these cognitive functions. 

The moderating role of executive functions can be examined 
selectively. In addition, worked examples and problem solving 
could be contrasted by using tasks that simultaneously put a high 
or low demand on working memory capacity (e.g., time-limited 
tasks), shifting, or fluid intelligence. The relative importance of 
these cognitive functions and their dependence on characteristics 
of tasks could thus be determined. 

In sum, the findings of the present study implicate that other 
prominent cognitive functions such as shifting and fluid intelli- 
gence might be as important as prior knowledge (e.g., Kalyuga, 
2007) or working memory (e.g., Sweller, 2011) when worked 
examples are compared with problem solving. A learner’s self- 
reported cognitive load may be less valid as a predictor for learn- 
ing outcomes in environments with or without worked examples 
than more objective measures of executive functions. In addition, 
the zero correlation of cognitive load with working memory ca- 
pacity indicates that the construct validity of cognitive load may be 
doubtful, at least for this type of task, even when acknowledging 
method variance and level of symmetry differences as possible 
explanations. Although inconvenient for instructional designers, it 
might well turn out that cognitive load is a multidimensional rather 
than a unidimensional construct. 

This conclusion has to be validated, however, in further studies 
that also use more objective measures of cognitive load (e.g., 
secondary tasks; see Briinken et al., 2003). The practical conclu- 
sion is quite straightforward: Confronting learners in early phases 
of skill acquisition in the area of statistics with open problem 
solving will mainly help the teacher to distinguish between learn- 
ers with better executive functioning and higher fluid intelligence. 
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Providing worked examples may contribute to giving students with 
different prerequisites similar chances to learn. 

Future studies should replicate the results of this study using 
different and larger samples to detect moderation effects with a 
higher statistical power. These studies should employ different 
types of statistical problems and different types of tasks as well as 
other domains. 


References 


Andersson, U. (2010). Skill development in different components of arith- 
metic and basic cognitive functions: Findings from a 3-year longitudinal 
study of children with different types of learning difficulties. Journal of 
Educational Psychology, 102, 115-134. http://dx.doi.org/10.1037/ 
a0016838 

Arendasy, M., Hornke, L. F., Sommer, M., Hausler, J., Wagner-Menghin, 
M., Gittler, G., . . . Kértner, T. (2012). Intelligenz-Struktur-Batterie 
(INSBAT) [Intelligence Structure Battery]. Manual. Médling, Austria: 
Schuhfried GmbH. 

Ayres, P., & Sweller, J. (2005). The split-attention principle in multimedia 
learning. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia 
learning (Vol. 2, pp. 135-146). New York, NY: Cambridge University 
Press. http://dx.doi.org/10.1017/CBO9780511816819.009 

Baddeley, A. D., Allen, R. J., & Hitch, G. J. (2011). Binding in visual 
working memory: The role of the episodic buffer. Neuropsychologia, 49, 
1393-1400. http://dx.doi.org/10.1016/j.neuropsychologia.2010.12.042 

Berends, I. E., & van Lieshout, E. C. D. M. (2009). The effect of 
illustrations in arithmetic problem-solving: Effects of increased cogni- 
tive load. Learning and Instruction, 19, 345-353. http://dx.doi.org/10 
.1016/j.learninstruc.2008.06.012 

Best, J. R., Miller, P. H., & Jones, L. L. (2009). Executive functions after 
age 5: Changes and correlates. Developmental Review, 29, 180-200. 
http://dx.doi.org/10.1016/j.dr.2009.05.002 

Best, J. R., Miller, P. H., & Naglieri, J. A. (2011). Relations between 
executive function and academic achievement from ages 5 to 17 in a 
large, representative national sample. Learning and Individual Differ- 
ences, 21, 327-336. http://dx.doi.org/10.1016/j.lindif.2011.01.007 

Blair, C., Knipe, H., & Gamson, D. (2008). Is there a role for executive 
functions in the development of mathematics ability? Mind, Brain, and 
Education, 2, 80-89. http://dx.doi.org/10.1111/j.1751-228X.2008 
.00036.x 

Briinken, R., Plass, J. L., & Leutner, D. (2003). Direct measurement of 
cognitive load in multimedia learning. Educational Psychologist, 38, 
53-61. http://dx.doi.org/10.1207/S15326985EP3801_7 

Buehner, M., Krumm, S., Ziegler, M., & Pluecken, T. (2006). Cognitive 
abilities and their interplay. Journal of Individual Differences, 27, 57— 
72. http://dx.doi.org/10.1027/1614-0001.27.2.57 

Biihner, M., Kroner, S., & Ziegler, M. (2008). Working memory, visual— 
spatial-intelligence and their relationship to problem-solving. Jntelli- 
gence, 36, 672-680. http://dx.doi.org/10.1016/j.intell.2008.03.008 

Bull, R., & Lee, K. (2014). Executive functioning and mathematics 
achievement. Child Development Perspectives, 8, 36—41. http://dx.doi 
.org/10.1111/cedep.12059 

Carroll, W. M. (1994). Using worked examples as an instructional support 
in the algebra classroom. Journal of Educational Psychology, 86, 360— 
367. http://dx.doi.org/10.1037/0022-0663.86.3.360 

Chi, M. T. H., Bassok, M., Lewis, M. W., Reimann, P., & Glaser, R. 
(1989). Self-explanations: How students study and use examples in 
learning to solve problems. Cognitive Science, 13, 145-182. http://dx 
-doi.org/10.1207/s15516709cog1302_1 

Cohen, J. (1988). Statistical power analysis for the behavioral sciences 
(2nd ed.). Hillsdale, NJ: Erlbaum. 

Conway, A. R., Kane, M. J., Bunting, M. F., Hambrick, D. Z., Wilhelm, O., 
& Engle, R. W. (2005). Working memory span tasks: A methodological 


review and user’s guide. Psychonomic Bulletin & Review, 12, 769-786. 
http://dx.doi.org/10.3758/BF03 196772 

Daneman) M., & Merikle, P. M. (1996). Working memory and language 
comprehension: A meta-analysis. Psychonomic Bulletin & Review, 3, 
422-433. http://dx.doi.org/10.3758/BF03214546 

de Jong, T. (2010). Cognitive load theory, educational research, and 
instructional design: Some food for thought. /nstructional Science, 38, 
105-134. http://dx.doi.org/10.1007/s11251-009-9110-0 

de Jong, T., & Ferguson-Hessler, M. G. M. (1996). Types and qualities of 
knowledge. Educational Psychologist, 31, 105-113. http://dx.doi.org/10 
.1207/s15326985ep3102_2 

de Koning, B. B., Tabbers, H. K., Rikers, R. M. J. P., & Paas, F. (2010). 
Attention guidance in learning from a complex animation: Seeing is 
understanding? Learning and Instruction, 20, 111-122. http://dx.doi 
.org/10.1016/j.learninstruc.2009.02.010 

Diamond, A. (2013). Executive functions. Annual Review of Psychology, 
64, 135-168. http://dx.doi.org/10.1146/annurev-psych-113011-143750 

Engle, R. W., Tuholski, S. W., Laughlin, J. E., & Conway, A. R. A. (1999). 
Working memory, short-term memory, and general fluid intelligence: A 
latent-variable approach. Journal of Experimental Psychology: General, 
128, 309-331. http://dx.doi.org/10.1037/0096-3445.128.3.309 

Friedman, N. P., Miyake, A., Altamirano, L. J., Corley, R. P., Young, S. E., 
Rhea, S. A., & Hewitt, J. K. (2015). Stability and change in executive 
function abilities from late adolescence to early adulthood: A longitu- 
dinal twin study. Developmental Psychology. Advance online publica- 
tion. http://dx.doi.org/10.1037/dev0000075 

Friedman, N. P., Miyake, A., Corley, R. P., Young, S. E., Defries, J. C., & 
Hewitt, J. K. (2006). Not all executive functions are related to intelli- 
gence. Psychological Science, 17, 172-179. http://dx.doi.org/10.1111/j 
.1467-9280.2006.01681.x 

Friedman, N. P., Miyake, A., Robinson, J. L., & Hewitt, J. K. (2011). 
Developmental trajectories in toddlers’ self-restraint predict individual 
differences in executive functions 14 years later: A behavioral genetic 
analysis. Developmental Psychology, 47, 1410-1430. http://dx.doi.org/ 
10.1037/a0023750 

Friedman, N. P., Miyake, A., Young, S. E., Defries, J. C., Corley, R. P., & 
Hewitt, J. K. (2008). Individual differences in executive functions are 
almost entirely genetic in origin. Journal of Experimental Psychology: 
General, 137, 201-225. http://dx.doi.org/10.1037/0096-3445.137.2.201 

Friso-van den Bos, I., van der Ven, S. H. G., Kroesbergen, E. H., & van 
Luit, J. E. H. (2013). Working memory and mathematics in primary 
school children: A meta-analysis. Educational Research Review, 10, 
29-44. http://dx.doi.org/10.1016/j.edurev.2013.05.003 

Gathercole, S. E., & Alloway, T. P. (2007). Understanding working mem- 
ory: A classroom guide. London, UK: Harcourt Assessment. Retrieved 
from http://www. york.ac.uk/res/wml/Classroom%20guide.pdf 

Gimino, A. (2002). Students’ investment of mental effort. Paper presented 
at the annual meeting of the American Educational Research Associa- 
tion, New Orleans, LA. 

Hayes, A. F. (2012). PROCESS: A versatile computational tool for ob- 
served variable mediation, moderation, and conditional process model- 
ing [White paper]. Retrieved from http://www.afhayes.com/public/ 
process2012.pdf 

Hayes, A. F. (2013). Introduction to mediation, moderation, and condi- 
tional process analysis: A regression-based approach. New York, NY: 
Guilford Press. 

Hedden, T., & Yoon, C. (2006). Individual differences in executive pro- 
cessing predict susceptibility to interference in verbal working memory. 
Neuropsychology, 20, 511-528. 

Hiebert, J., & Lefevre, P. (1984). Conceptual and procedural knowledge in 
mathematics: An introductory analysis. In J. Hiebert (Ed.), Conceptual 
and procedural knowledge: The case of mathematics (pp. 1-29). Hills- 
dale, NJ: Erlbaum. 


EXECUTIVE FUNCTIONS AND THE WORKED EXAMPLE EFFECT 995 


Kalyuga, S. (2007). Expertise reversal effect and its implications for 
learner-tailored instruction. Educational Psychology Review, 19, 509- 
539. http://dx.doi.org/10.1007/s10648-007-9054-3 

Kalyuga, S., & Sweller, J. (2004). Measuring knowledge to optimize 
cognitive load factors during instruction. Journal of Educational Psy- 
chology, 96, 558-568. http://dx.doi.org/10.1037/0022-0663.96.3.558 

Kane, M. J., Conway, A. R. A., Hambrick, D. Z., & Engle, R. W. (2007). 
Variation in working memory capacity as variation in executive atten- 
tion and control. In A. Conway, C. Jarrold, M. Kane, A. Miyake, & J. 
Towse (Eds.), Variation in working memory (pp. 21—48). New York, 
NY: Oxford University Press. 

Kane, M. J., Hambrick, D. Z., Tuholski, S. W., Wilhelm, O., Payne, T. W., 
& Engle, R. W. (2004). The generality of working memory capacity: A 
latent-variable approach to verbal and visuospatial memory span and 
reasoning. Journal of Experimental Psychology: General, 133, 189- 
217. 

Konig, C. J., Bithner, M., & Miirling, G. (2005). Working memory, fluid 
intelligence, and attention are predictors of multitasking performance, 
but polychronicity and extraversion are not. Human Performance, 18, 
243-266. http://dx.doi.org/10.1207/s15327043hup1803_3 

Krumm, S., Schmidt-Atzert, L., Buehner, M., Ziegler, M., Michalczyk, K., 
& Arrow, K. (2009). Storage and non-storage components of working 
memory predicting reasoning: A simultaneous examination of a wide 
range of ability factors. Intelligence, 37, 347-364. http://dx.doi.org/10 
.1016/j.intell.2009.02.003 

Lee, K., Bull, R., & Ho, R. M. H. (2013). Developmental changes in 
executive functioning. Child Development, 84, 1933-1953. http://dx.doi 
.org/10.1111/cdev.12096 

Leppink, J., Paas, F., Van der Vleuten, C. P. M., Van Gog, T., & Van 
Merriénboer, J. J. G. (2013). Development of an instrument for measur- 
ing different types of cognitive load. Behavior Research Methods, 45, 
1058-1072. http://dx.doi.org/10.3758/s13428-013-0334-1 

Leppink, J., Paas, F., Van Gog, T., van der Vleuten, C. P. M., & Van 
Merriénboer, J. J. G. (2014). Effects of pairs of problems and examples 
on task performance and different types of cognitive load. Learning and 
Instruction, 30, 32—42. http://dx.doi.org/10.1016/j.learninstruc.2013.12 
.001 

Li, S.-C., Lindenberger, U., Hommel, B., Aschersleben, G., Prinz, W., & 
Baltes, P. B. (2004). Transformations in the couplings among intellec- 
tual abilities and constituent cognitive processes across the life span. 
Psychological Science, 15, 155-163. 

Lusk, M. M., & Atkinson, R. K. (2007). Animated pedagogical agents: 
Does their degree of embodiment impact learning from static or ani- 
mated worked examples? Applied Cognitive Psychology, 21, 747-764. 

Lusk, D. L., Evans, A. D., Jeffrey, T. R., Palmer, K. R., Wikstrom, C. S., 
& Doolittle, O. E. (2009). Multimedia learning and individual differ- 
ences: Mediating the effects of working memory capacity with segmen- 
tation. British Journal of Educational Technology, 40, 636-651. http:// 
dx.doi.org/10.1111/j.1467-8535.2008.00848.x 

Mischel, W., Ayduk, O., Berman, M. G., Casey, B. J., Gotlib, I. H., 
Jonides, J., . . . Shoda, Y. (2011). “Willpower” over the life span: 
Decomposing self-regulation. Social Cognitive and Affective Neurosci- 
ence, 6, 252-256. http://dx.doi.org/10.1093/scan/nsq08 1 

Miyake, A., & Friedman, N. P. (2012). The nature and organization of 
individual differences in executive functions: Four general conclusions. 
Current Directions in Psychological Science, 21, 8-14. http://dx.doi 
.org/10.1177/096372141 1429458 

Miyake, A., Friedman, N. P., Emerson, M. J., Witzki, A. H., Howerter, A., 
& Wager, T. D. (2000). The unity and diversity of executive functions 
and their contributions to complex “frontal lobe” tasks: A latent variable 
analysis. Cognitive Psychology, 41, 49-100. http://dx.doi.org/10.1006/ 
cogp.1999.0734 


Murphy, K. R., & Myors, B. (2004). Statistical power analysis: A simple 
and general model for traditional and modern hypothesis tests (2nd ed.). 
Mahwah, NJ: Erlbaum. 

Nigg, J. T. (2000). On inhibition/disinhibition in developmental psycho- 
pathology: Views from cognitive and personality psychology and a 
working inhibition taxonomy. Psychological Bulletin, 126, 220-246. 

Oberauer, K., Schulze, R., Wilhelm, O., & Sii8, H. M. (2005). Working 
memory and intelligence—Their correlation and their relation: Com- 
ment on Ackerman, Beier, and Boyle (2005). Psychological Bulletin, 
131, 61-65. http://dx.doi.org/10.1037/0033-2909.131.1.61 

Opfermann, M. (2008). There’s more to it than instructional design: The 
role of individual learner characteristics for hypermedia learning. Ber- 
lin, Germany: Logos. : 

Paas, F. (1992). Training strategies for attaining transfer of problem- 
solving skill in statistics: A cognitive-load approach. Journal of Educa- 
tional Psychology, 84, 429-434. http://dx.doi.org/10.1037/0022-0663 
.84.4.429 

Primi, R., Ferrao, M. E., & Almeida, L. S. (2010). Fluid intelligence as a 
predictor of learning: A longitudinal multilevel approach applied to 
math. Learning and Individual Differences, 20, 446-451. http://dx.doi 
.org/10.1016/j.lindif.2010.05.001 

Redick, T. S., Broadway, J. M., Meier, M. E., Kuriakose, P. S., Unsworth, 
N., Kane, M. J., & Engle, R. W. (2012). Measuring working memory 
capacity with automated complex span tasks. European Journal of 
Psychological Assessment, 28, 164-171. http://dx.doi.org/10.1027/ 
1015-5759/a000123 

Redick, T. S., Unsworth, N., Kelly, A. J., & Engle, R. W. (2012). Faster, 
smarter? Working memory capacity and perceptual speed in relation to 
fluid intelligence. Journal of Cognitive Psychology, 24, 844-854. http:// 
dx.doi.org/10.1080/20445911.2012.704359 

Rheinberg, F., Vollmeyer, R., & Burns, B. D. (2001). QCM: A question- 
naire to assess current motivation in learning situations. Diagnostica, 47, 
57-66. http://dx.doi.org/10.1026//0012-1924.47.2.57 

Schneider, W. J., & McGrew, K. (2012). The Cattell-Horn—Carroll model 
of intelligence. In D. Flanagan & P. Harrison (Eds.), Contemporary 
intellectual assessment: Theories, tests, and issues (pp. 99-144). New 
York, NY: Guilford Press. 

Schwonke, R., Renkl, A., Krieg, C., Wittwer, J., Aleven, V., & Salden, R. 
(2009). The worked-example effect: Not an artefact of lousy control 
conditions. Computers in Human Behavior, 25, 258-266. http://dx.doi 
.org/10.1016/j.chb.2008.12.011 

Schwonke, R., Wittwer, J., Aleven, V., Salden, R. J. C. M., Krieg, C., & 
Renkl, A. (2007). Can tutored problem solving benefit from faded 
worked-out examples? In S. Vosniadou, D. Kayser, & A. Protopapas 
(Eds.), Proceedings of EuroCogSci 07. The European Cognitive Science 
Conference 2007 (pp. 59-64). New York, NY: Erlbaum. 

Seufert, T., Schiitze, M., & Briinken, R. (2009). Memory characteristics 
and modality in multimedia learning: An aptitude-treatment-interaction 
study. Learning and Instruction, 19, 28-42. http://dx.doi.org/10.1016/j 
.learninstruc.2008.01.002 

Sweller, J. (2011). Cognitive load theory. In J. Mestre & B. Ross (Eds.), 
The psychology of learning and motivation: Cognition in education 
(Vol. 55, pp. 37-76). Oxford, UK: Academic Press. 

van der Sluis, S., de Jong, P. F., & van der Leij, A. (2007). Executive 
functioning in children, and its relations with reasoning, reading, and 
arithmetic. Intelligence, 35, 427-449. http://dx.doi.org/10.1016/j.intell 
.2006.09.001 

Van Gerven, P. W. M., Paas, F., Van Merriénboer, J. J. G., & Schmidt, 
H. G. (2002). Cognitive load theory and aging: Effects of worked 
examples on training efficiency. Learning and Instruction, 12, 87-105. 
http://dx.doi.org/10.1016/S0959-4752(01)00017-2 

Van Gerven, P. W. M., Paas, F., Van Merriénboer, J. J. G., & Schmidt, 
H. G. (2004). Memory load and the cognitive pupillary response in 


996 SCHWAIGHOFER, BUHNER, AND FISCHER 


aging. Psychophysiology, 41, 167-174. http://dx.doi.org/10.1111/j.1469- 
8986.2003.00148.x 

van Gog, T., Paas, F., & Van Merriénboer, J. J. G. (2006). Effects of 
process-oriented worked examples on troubleshooting transfer perfor- 
mance. Learning and Instruction, 16, 154-164. http://dx.doi.org/10 
.1016/j.learninstruc.2006.02.003 

van Gog, T., & Rummel, N. (2010). Example-based learning: Integrating 
cognitive and social-cognitive research perspectives. Educational Psychol- 
ogy Review, 22, 155-174. http://dx.doi.org/10.1007/s 10648-010-9134-7 

Wecker, C., Rachel, A., Heran-Dérr, E., Waltner, C., Wiesner, H., & 
Fischer, F. (2013). Presenting theoretical ideas prior to inquiry activities 
fosters theory-level knowledge. Journal of Research in Science Teach- 
ing, 50, 1180-1206. http://dx.doi.org/10.1002/tea.21106 

Yeniad, N., Malda, M., Mesman, J., van Ijzendoorn, M. H., & Pieper, S. 
(2013). Shifting ability predicts math and reading performance in chil- 


dren: A meta-analytical study. Learning and Individual Differences, 23, 
1-9. http://dx.doi.org/10.1016/j.lindif.2012.10.004 

Yuan, K.,\Steedle, J., Shavelson, R., Alonzo, A., & Oppezzo, M. (2006). 
Working memory, fluid intelligence, and science learning. Educational 
Research Review, 1, 83-98. http://dx.doi.org/10.1016/j.edurev.2006.08 
.005 

Zepeda, C. D., Richey, J. E., Ronevich, P., & Nokes-Malach, T. J. 
(2015). Direct instruction of metacognition benefits adolescent sci- 
ence learning, transfer, and motivation: An in vivo study. Journal of 
Educational Psychology, 107, 954-970. http://dx.doi.org/10.1037/ 
edu0000022 

Ziegler, M., Danay, E., Heene, M., Asendorpf, J., & Biihner, M. (2012). 
Openness, fluid intelligence, and crystallized intelligence: Toward an 
integrative model. Journal of Research in Personality, 46, 173-183. 
http://dx.doi.org/10.1016/j.jrp.2012.01.002 


Appendix 


Tables for Correlations and Moderating Effects 
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Table A3 


’ Correlations Among Variables 
cease re nee Seeds a cee UN A OLA EA siege een \ oy o'N Jogonpaes ey iy a Bh tomtthno 


Prior 
application- QCM scale 
Cognitive Fluid Prior conceptual oriented QCM scale probability of | QCM scale 
Measure load intelligence knowledge knowledge anxiety success challenge 
Cognitive load l = 30h i) O87 25" ee on 
Fluid intelligence me 1 18 ee —.18 oe —.09 
Prior conceptual knowledge eee 18 1 PAB ae Sue Dor —.20 
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eerie ae anxiety 25° —.18 —.08 —.16 1 —30"* 16 
QCM scale probability of success ska .24* 260" 1350) eS Ons 1 .09 
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Note. N = 76 for all correlations. QCM = Questionnaire of Current Motivation. 
epe—e0oe ~ p< 0h p.< 00m 


Table A4 


Conditional Effect of the Presence of Worked Examples on Acquisition of Application-Oriented Knowledge for Different Values of 
Prior Application-Oriented Knowledge 
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application-oriented Effect of the presence nee le Dae 
knowledge* of worked examples SE t D LL UL 
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Note. Theoretical values of the prior application-oriented knowledge are between 0 and 6 points. Higher values indicate a higher prior application-oriented 
knowledge. Values between 0 and | are theoretical values for which the effect of the presence of worked examples on knowledge acquisition was calculated 
based on the sample data. SE = standard error; CI = confidence interval; LL = lower limit, UL = upper limit. 

4 53% of the participants had a prior application-oriented knowledge of 1 point. 91% of the participants had a prior knowledge of, at most, 3 points. Five 
participants had a total value of 4 points, one participant had a total value of 5, and one participant had a prior knowledge of 6 points. 
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Table AS 
Conditional Effect of the Presence of Worked Examples on Acquisition of AUGUSTE -Oriented Knowledge for Different Values of the 
Moderator Working Memory Capacity 
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3310 9542 .7130 1.3383 1852 —.4682 2.3767 
3648 .9384 .6605 1.4208 1599 = 5192 2.2561 
3987 9227 .6096 1.5136 1347 —.2934 2.1387 
4325 .9069 5607 1.6175 1103 SPT) 2.0254 
.4663 8911 5144 11323 .0877 —al35L 1.9174 
5002 8754 4715 1.8564 .0677 —.0653 1.8160 
5340 8596 4330 1.9851 0511 — .0043 1.7234 
5366 8584 4303 1.9950 .0500 « .0000 1.7167 
5678 8438 4002 2.1085 .0386 0455 1.6422 
.6017 8280 3745 DAT .0303 .0809 1.5751 
.6355 8123 SOND 2.2720 .0262 .0990 1.5255 
.6693 .7965 3505 2.2725 0262 .0973 1.4957 
.7032 .7807 3540 2.2053 .0308 0745 1.4870 
.7370 .1649 3678 2.0798 0413 0312 1.4987 
7553 .7564 3792 1.9950 .0500 .0000 1.5128 
.7708 7492 3908 1.9172 0594 —.0304 1.5287 
8047 .7334 4214 1.7404 .0862 = L073 1.5741 
8385 1176 4582 1.5663 1218 —.1964 1.6316 
8723 7019 4997 1.4045 1647 =-2951 1.6988 
.9062 .6861 5450 1.2590 ONS) —.4011 17733) 
.9400 .6703 5931 1.1303 .2623 =od28 1.8534 


a a ee 
Note. Theoretical values of working memory capacity are between 0 and 1. Higher values indicate a higher working memory capacity. Values below .26 
and above .94 were not realized in the sample. Values between .26 and .94 are theoretical values of working memory capacity, for which the effect of the 
presence of worked examples on knowledge acquisition was calculated based on the sample data. SE = standard error; CI = confidence interval; LL = 
lower limit; UL = upper limit. 
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Table A6 


Conditional Effect of the Presence of Worked Examples on Acquisition of Application-Oriented Knowledge at Values of the 
Moderator Shifting 








95% CI 
Effect of the presence a ee 
Shifting (ms) of worked examples SE t Dp LL UL 

53.1839 — .0832 4491 s3 8535 —.9789 .8124 
89.7237 .0710 .4163 -1706 .8650 = 595 9013 
126.2635 D2OD 3876 5811 5631 —.5479 9984 
162.8033 3795 3640 1.0425 .3008 —.3465 1.1054 
199.3431 33 3464 1.5406 1279 — Louie 1.2246 
231.9738 .6714 .3367 1.9944 .0500 .0000 1.3429 
235.8829 .6879 3359 2.0481 .0443 .0180 1.3578 
272.4226 .8422 3330 D201 .0137 .1779 1.5064 
308.9624 .9964 3381 2.9472 .0044 22 1.6707 
345.5022 1.1506 .3507 3.2813 .0016 4512 1.8500 
382.0420 1.3048 .3700 3.5263 .0007 5668 2.0429 
418.5818 1.4591 3952 3.6921 .0004 .6709 2.2472 
455.1216 1.6133 4251 3.7952 .0003 -7655 2.4611 
491.6614 1.7675 4588 3.8522 .0003 .8524 2.6826 
528.2012 1.9218 .4956 3.8775 .0002 9333 2.9102 
564.7410 2.0760 5348 3.8816 .0002 1.0093 3.1427 
601.2808 2.2302 5760 3.8722 .0002 1.0815 3.3789 
637.8205 2.3844 .6186 3.8544 .0003 1.1506 3.6183 
674.3603 2.5387 .6625 3.8317 .0003 F2A7S 3.8601 
710.9001 2.6929 .1075 3.8064 .0003 1.2819 4.1039 
747.4399 2.8471 oon 3.7799 .0003 1.3449 4.3494 
783.9797 3.0014 1997 B97/533 .0004 1.4065 4.5963 


Note. Shifting values are reaction times (in ms). Higher reaction times indicate a lower shifting ability. Shifting values below 53.18 ms and above 783.98 
ms were not realized in the sample. Values between 53.18 ms and 783.98 ms are theoretical values of the shifting ability, for which the effect of the presence 
of worked examples on knowledge acquisition was calculated based on the sample data. SE = standard error; CI = confidence interval; LL = lower limit; 
UL = upper limit. 
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Table A7 
Conditional Effect of the Presence of Worked Examples on Acquisition of Application-Oriented Knowledge at Values of the 
Moderator Fluid Intelligence 





95% CI 
Fluid intelligence Effect of the presence 
(raw score) of worked examples SE t P EL UL 

— 1.5700 25127 .8040 351253 .0026 .9088 4.1165 
— 1.3350 2.3181 253 3.1958 .0021 8711 3.7651 
— 1.1000 21235 .6490 3.2719 .0017 .8288 3.4183 
— .8650 1.9289 5159 3.3494 .0013 .7800 3.0778 
—.6300 1.7343 5074 3.4182 .0011 .7221 2.7466 
— 3950 1.5398 4456 3.4555 .0009 .6508 2.4287 
—.1600 1.3452 3937 3.4165 0011 25997 2.1307 
.0750 1.1506 3561 3.2309 .0019 4402 1.8611 
.3100 .9560 .3376 2.8318 .0061 .2825 1.6296 
5450 HOLS .3413 2232 .0289 * 0806 1.4423 
.6268 .6938 .3478 1.9950 .0500 .0000 1.3875 
.7800 5669 .3665 1.5469 .1265 —.1642 1.2980 
1.0150 3723 4092 9098 3661 —.444] 1.1887 
1.2500 el 4647 .3824 .7033 —.7494 1.1049 
1.4850 —.0168 5290 SeOSIS 9747 = 10722 1.0385 
1.7200 —.2114 5992 3528 2/293 — 1.4068 .9840 
1.9550 — .4060 .6735 — 6028 5486 — 1.7496 .9376 
2.1900 — .6006 .7506 —.8001 4264 —2.0981 8969 
2.4250 7 Opi .8299 29581 3413 —2.4507 .8604 
2.6600 — .9897 .9106 — 1.0868 .2809 —2.8064 .8270 
2.8950 — 1.1843 9926 —1.1932 .2369 —3.1644 .7958 
3.1300 — 1.3789 1.0754 = 122822 .2041 —3.5242 .7665 





Note. Higher raw scores indicate a higher fluid intelligence. Fluid intelligence values below —1.57 and above 3.13 were not realized in the sample. The 
values between —1.57 and 3.13 are theoretical values of fluid intelligence (except the value -.16, which was realized in the sample), for which the effect 
of the presence of worked examples on knowledge acquisition was calculated based on the sample data. SE = standard error; CI = confidence interval; 
LL = lower limit; UL = upper limit. 
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Developmental Trajectories of Academic Achievement in Chinese 
Children: Contributions of Early Social-Behavioral Functioning 
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This study explored the developmental trajectories of academic achievement and the contributions of 
early social behaviors and problems to these trajectories in Chinese children. Data were collected each 
year in 5 consecutive years from a sample of elementary schoolchildren in China (initially N = 1,146, 
609 boys, initial M age = 8.33 years). Four distinct academic achievement trajectories were identified: 
low-stable, high/moderate-decreasing, high-increasing, and high-stable. Children high on sociability and 
low on externalizing behaviors and girls were more likely to be classified in the higher academic 
achievement trajectories. Initial higher levels of social competence were associated with lower decreasing 
rates of academic achievement within the high/moderate-decreasing trajectory. Initial lower levels of 
shyness and fewer externalizing behaviors predicted higher growth rates within the high-increasing 
trajectory. In addition, within the low-stable trajectory, children initially low on shyness and high on 
social-behavioral problems remained poor in academic achievement over time. The results suggest the 
significance of social-behavioral functioning in predicting the distinctive trajectories of academic 
achievement in Chinese children. 


Keywords: academic achievement, adjustment, developmental trajectories 


The attainment of academic achievement is one of the most 
important tasks for school-age children in the Chinese society. It 
has been found that Chinese children outperform their counterparts 
in many other countries in academic areas throughout the elemen- 
tary and high school years (e.g., Stevenson, Chen, & Lee, 1993; 
Zhou, Main, & Wang, 2010). Whereas academic success is a major 
source of pride for the family, the child’s failure in academic 
achievement may bring disgrace and shame to parents and ances- 
tors (Ho, 1986). The Confucian doctrine of filial piety, for exam- 
ple, stipulates that children have the obligation to maintain and 
enhance the status and reputation of the family. In childhood and 
adolescence, this obligation is reflected mostly in school perfor- 
mance (e.g., Fuligni, Tseng, & Lam, 1999). Although the Chinese 
society has changed substantially over the past decades, many of 
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the traditional beliefs and values, including those concerning ac- 
ademic achievement, have been retained in contemporary China. 
Children in Chinese schools receive high pressure to perform 
optimally on academic work; those who perform well are often 
praised by teachers and parents and respected by peers, but those 
who fail to meet the standard are regarded as problematic (Phil- 
lipson & Phillipson, 2007) and are likely to be criticized by adults 
and rejected by peers (Chen, Rubin, & Li, 1997; Phillipson & 
Phillipson, 2007; Zhou et al., 2010). 

Researchers have conducted a number of studies to examine the 
factors that contribute to academic success and failure in Chinese 
children (e.g., Chen, Huang, Chang, Wang, & Li, 2010; Liu, 
Bullock, & Coplan, 2014; Zhou et al., 2010). In general, the results 
indicate that socially competent and appropriate behaviors are 
associated with high academic achievement and that problem 
behaviors, such as disruptive and aggressive behaviors, are asso- 
ciated with academic difficulties. Children who display sociable 
and prosocial-cooperative behaviors are likely to receive assis- 
tance and support from others on schoolwork. In contrast, children 
who display aggressive-disruptive and other externalizing behav- 
iors are likely to create an undesirable environment for learning 
(e.g., Chang, 2004; Wentzel, 2005). 


Trajectories of Academic Achievement in 
Chinese Children 


The existing studies have provided valuable information about 
factors that are correlated with academic achievement among 
Chinese children. However, the majority of these studies were 
cross-sectional, and the existing longitudinal studies (e.g., Liu et 
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al., 2014; Zhou et al., 2010) mostly included two-wave data, which 
did not allow for an examination of the growth patterns of indi- 
viduals over time. The impact of social and behavioral factors on 
academic achievement are likely to be continuous and long-term. 
To examine the developmental pathways of academic achievement 
and effects of social and behavioral factors on academic growth, at 
least three assessment points are needed. Therefore, we conducted 
this five-wave longitudinal study in Chinese school-age children to 
explore the developmental trajectories of academic achievement 
and contributions of early social-behavioral functioning to the 
trajectories. 

In the study of academic achievement, researchers have tradi- 
tionally treated Chinese children as a single group. This type of 
design is based on the assumption that a common developmental 
process holds for all individuals in the population. However, there 
is growing evidence indicating that children exhibit multiple pat- 
terns of academic development in childhood and adolescence (e.g., 
Hao & Woo, 2012; Hodis, Meyer, McClure, Weir, & Walkey, 
2011; Ladd & Dinella, 2009). Specifically, students typically vary 
on the initial level of academic achievement. Moreover, over time, 
academic performance may be stable among some students but 
increase or decrease among others. For example, Hodis and col- 
leagues (2011) identified two achievement trajectories in a diverse 
sample of high school students in New Zealand: students with a 
high initial level and slightly declining achievement and students 
with a lower initial level and steeply declining achievement. In a 
sample of adolescent children of immigrants in the United States, 
Hao and Woo (2012) identified four academic trajectories: high- 
fast growing, high-moderate growing, low-fast growing, and low- 
stable. 

Importantly, children in different academic achievement trajec- 
tories appear to be susceptible to the influence of different factors. 
For example, researchers found that increasing achievement was 
predicted by stronger cognitive abilities (Aunola, Leskinen, Lerk- 
kanen, & Nurmi, 2004) and better learning-related skills such as 
self-regulation (McClelland, Acock, & Morrison, 2006). In con- 
trast, students who had decreasing academic performance tended 
to have poor attendance rates at school (Hao & Woo, 2012) and 
low perceived self-efficacy (Caprara et al., 2008). Moreover, the 
same factor may have different effects on children in different 
trajectories (e.g., Chen, Hughes, & Kwok, 2014; Hodis et al., 
2011). A negative motivation orientation, for example, under- 
mined academic performance particularly for students who ini- 
tially fell behind their peers (Hodis et al., 2011). Taken together, 
these findings suggest that in order to fully understand the devel- 
opment of children’s academic achievement, it is important to take 
developmental heterogeneity into account and examine distinct 
academic trajectories and their predicting factors. 

Research on children’s academic trajectories has been con- 
ducted mostly in Western societies. Little is known about the 
trajectories of academic achievement among children in other 
societies. It is possible that academic trajectories show a distinct 
pattern in Chinese children. For example, a commonly reported 
trajectory of academic achievement in studies conducted in the 
United States is a low-increasing trajectory in which children had 
initially low achievement but displayed a moderate or fast growth 
over time (e.g., Chen et al., 2014; Hao & Woo, 2012; McClelland 
et al., 2006). However, this may not be the case in Chinese 
children, due to the specific features of social and school contexts. 
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In Chinese schools, main subjects such as Chinese language and 
mathematics are taught from lower to higher grades in a progres- 
sive manner with increasing difficulty and minimum repetition. 
Doing well in these subjects requires persistent effort and thorough 
understanding of the material in lower grades. Falling behind in 
these subjects usually makes it very difficult to catch up later. In 
addition, in Chinese schools, students’ academic achievement is 
often evaluated publicly, and those who underperform are some- 
times criticized and even humiliated by teachers and peers. Con- 
ceivably, underachieving students face great pressure, which may 
foster their negative attitudes toward academic work. Given the 
emphasis on academic achievement, Chinese children with poor 
academic performance are often disliked by peers (e.g., Chen et al., 
1997), preventing them from getting necessary support and assis- 
tance from others. These unfavorable school conditions for chil- 
dren with low academic achievement may make it particularly 
difficult for them to improve their performance over time. Thus, 
there may not be a low-increase academic trajectory evident 
among Chinese children as typically found among Western chil- 
dren. As such, it is important to conduct research to explore 
academic trajectories in Chinese children. 


Social-Behavioral Functioning and Academic 
Trajectories in Chinese Children 


Different aspects of social-behavioral functioning may have 
differential effects on the academic trajectories as well as within- 
trajectory growth among Chinese children. Researchers who are 
interested in children’s social-behavioral functioning typically fo- 
cus on socially competent (e.g., sociable-prosocial), externalizing 
(e.g., aggression, disruption), and internalizing (e.g., anxiety, so- 
cial withdrawal) behaviors (e.g., Morison & Masten, 1991; Rubin, 
Chen, McDougall, Bowker, & McKinnon, 1995). Consistent with 
the literature (e.g., Caprara et al., 2008), it has been found that 
Chinese children who display sociable and prosocial behaviors 
tend to perform well academically (Chen, Rubin, & Li, 1995). 
Thus, we expected that socially competent children would be more 
likely than socially incompetent children to have high initial aca- 
demic performance. Socially competent children may possess so- 
cial skills to approach peers for academic assistance in an appro- 
priate way, and their peers are likely to offer support to these 
children when they are in need (Jia et al., 2009). The positive 
reactions from peers help socially competent children maintain 
confidence to overcome obstacles, reduce academic stress, and 
develop positive attitudes toward schoolwork. Therefore, it is 
conceivable that socially competent children are also more likely 
than others to gain increasing academic achievement over time, 
and their improvement may be faster than that of socially incom- 
petent children. Even among children who have a decreasing 
academic trajectory, those who are relatively more competent in 
social areas may show a less dramatic decline. 

Recent research with Chinese children has indicated that early 
externalizing problem behaviors contributed significantly and neg- 
atively to subsequent academic failure and difficulties (Chen, Cen, 
Li, & He, 2005; Zhou et al., 2010). It is likely that externalizing 
behaviors create an environment that perpetuates low academic 
performance. Specifically, aggressive, disruptive and other exter- 
nalizing behaviors are typically perceived as highly problematic in 
China (Chang, 2004; Chen et al., 2010). Furthermore, children 
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with externalizing behaviors tend to lack the ability to concentrate 
on work and regulatory skills to deal with frustration (e.g., 
Krueger, Caspi, Moffitt, White, & Stouthamer-Loeber, 1996), 
which may also inhibit academic progress. Therefore, externaliz- 
ing behaviors may lead to low and decreasing academic perfor- 
mance. Among the children who have a decreasing academic 
trajectory, the display of externalizing behaviors may exacerbate 
their difficulties, leading to a steeper declining trend. 

Similar to externalizing problems, internalizing problems have 
been found to have negative effects on academic achievement in 
Chinese children (e.g., Chen et al., 1995; Liu, Zhou, & Li, 2012). 
As the school year advances, academic tasks typically get more 
difficult. Children with internalizing problems are often highly 
sensitive and vulnerable to stress and distress, which makes them 
less able to cope with the increasing academic pressure. In addi- 
tion, in the Chinese society, group orientation and connectedness 
are valued, and individuals are expected to display regulated 
emotions to maintain social affiliation. Children who show height- 
ened anxiety, depression, and other internalizing symptoms may 
be perceived as “abnormal” and excluded from peer interactions. 
Therefore, it seems reasonable to expect that the academic 
achievement of children with higher levels of internalizing prob- 
lems may be lower and may decrease at an accelerating rate over 
time. As a salient internalizing social behavior, social withdrawal 
has received particular attention in the study of academic achieve- 
ment in childhood and adolescence (e.g., Chang et al., 2005; 
Morison & Masten, 1991). Shyness and unsociability are two main 
types of social withdrawal in Western and Chinese children (e.g., 
Chen, Wang, & Cao, 2011; Coplan & Armer, 2007); both may 
have significant implications for trajectories of academic achieve- 
ment. Whereas shyness represents an anxious reactivity to chal- 
lenging social situations, unsociability refers to the low tendency 
to participate in social interaction or a nonfearful preference for 
solitude. According to a conceptual model proposed by Asendorpf 
(1990), shyness is derived from a conflict between approach and 
avoidance motivations, indicating internal anxiety, fear, and lack 
of self-confidence. In contrast to shyness, unsociability is based on 
low levels of both approach and avoidance in social settings; 
unsociable children are characterized as lacking a strong desire to 
play with others although they may not actively avoid peer inter- 
action (Coplan & Weeks, 2010). Traditionally, shyness was valued 
in the Chinese society and was linked with academic achievement 
(e.g., Chen et al., 1995; Chen, Rubin, & Sun, 1992). However, as 
the Chinese society has become a more competitive, market- 
oriented society since the early 1990s, shyness has been found to 
be associated with increased academic difficulties (Chang et al., 
2005; Chen et al., 2005; Chen, Wang, & Wang, 2009). In com- 
parison to shyness, unsociability is associated with more pervasive 
and negative social evaluations in the Chinese society, because it 
is directly incompatible with the emphasis on group orientation 
and interpersonal connectedness. In the present study, we were 
interested in how shyness and unsociability would predict decreas- 
ing academic trajectories and inhibit the growth rate of academic 
achievement over time. 

Finally, research has consistently demonstrated negative effects 
of victimization on concurrent and later academic achievement 
(Liu et al., 2014; Schwartz, Gorman, Nakamoto, & Toblin, 2005). 
Compared to other peers, victimized children appear to be less 
capable of coping with school demands and concentrating on 
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schoolwork, leading to poorer academic performance. Therefore, it 
is plausible that detrimental to children’s academic performance, 
victimization may predict low and decreasing performance, as well 
as a slower growth rate or steeper decline rate within a trajectory. 


Overview of the Present Study 


The primary purpose of the present longitudinal study was to 
examine growth trajectories of academic achievement and the 
contributions of early social-behavioral functioning to these tra- 
jectories in Chinese children. We focused on the period from 
second grade to sixth grade, which represents an important period 
of development in Chinese children as they engage in extensive 
social interactions and experience increased pressure on social and 
school performance (e.g., Chen et al., 2010). We included peer- 
assessed sociability-leadership, aggression, shyness and unsocia- 
bility in our study. To complement the peer ratings, we also asked 
the head teacher of each class to assess their students’ school- 
related competence, including social competence, victimization 
experiences, as well as externalizing and internalizing problems. 
These measures have been found to adequately capture students’ 
school-related behaviors and competence (e.g., Hightower et al., 
1986) and have important relations with Chinese children’s aca- 
demic achievement (e.g., Chen et al., 1997, 2010; Liu et al., 2014). 

Based on the discussion above, we first hypothesized that for 
students who start with high initia] performance, there might be an 
increasing trajectory, a decreasing trajectory and a stable trajec- 
tory. For students with low initial performance, low or decreasing 
trajectories might be more likely to emerge than increasing trajec- 
tory. Second, we expected that children’s early social-behavioral 
functioning, including social competence, peer victimization, so- 
cial withdrawal, externalizing and internalizing problems would 
differentiate trajectories of academic performance. More specifi- 
cally, high social competence, low social withdrawal, and few 
externalizing and internalizing problems would predict high- 
performance category and promote academic growth. In contrast, 
low social competence and high levels of problematic behaviors 
and maladjustment would predict low-performance category and 
predict decreasing achievement. To our knowledge, this was the 
first study assessing in the urban Chinese context the contributions 
of early social behaviors and adjustment to identifying distinctive 
academic trajectories as well as to predicting academic growth 
within trajectories. We believe that this study would help us better 
understand, in today’s urban China, the significance of social 
behaviors and adjustment in predicting distinctive academic pro- 
files. 


Method 


Participants 


Participants were 1,146 second-grade children (609 boys) in 
ordinary elementary schools in Beijing (inside the 5th Ring Road), 
P. R. China. Unlike a few “key” schools, in which students were 
selected from different areas based on their school performance, 
students in ordinary schools came from the residential area in 
which the school is located. There were 30 classes, with approx- 
imately 40 students in each class. The initial mean age of children 
in this sample was 8 years 4 months (SD = 8 months). The core 
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curriculum, including Chinese, mathematics, and English, is stip- 
ulated by the Ministry of Education in China. The structure and 
organization of elementary schools are similar across schools. 
Students are encouraged to participate in a variety of extracurric- 
ular social and academic activities in school, which provides 
extensive opportunities for peer interactions. One teacher is des- 
ignated to be in charge of a class. This head teacher often teaches 
one major course and takes care of the social and daily activities of 
the class. The schedule of courses and other activities is typically 
identical for students in the same class. Students spend roughly the 
same amount of time in the classroom. 

Almost all of the children (98%) were from intact families, and 
92% of them were only children whereas others had one or more 
siblings. The participants came from families with mostly low to 
middle socioeconomic status. Preliminary analyses indicated that 
family socioeconomic status and other demographic variables had 
nonsignificant effects on the variables or relations of interest in the 
study. 

We collected follow-up data on academic achievement near the 
end of each school year (May and June) in the same schools for 
Grades 3 to 6. These data were collected for 92.8% of the students 
from the original sample and 160 additional students who did not 
participate in the initial study. There were no significant differ- 
ences on the variables of interest between children who partici- 
pated in all waves and those who did not. 


Procedure 


In Grade 2, we group administered to the children a peer 
assessment measure of social behaviors. Teachers were requested 
to rate each participant concerning his or her school-related social 
competence, externalizing and internalizing problems, and victim- 
ization. Data concerning children’s academic achievement were 
obtained from school records for Grades 2 to 6. 

The members of our research team carefully examined the items 
in the measures, using a variety of strategies (e.g., repeated dis- 
cussion in the research group, interviews with children and teach- 
ers, psychometric analysis). The measures have proved valid and 
appropriate in Chinese as well as some other cultures (e.g., Chen 
et al., 2005). Extensive explanations of the procedure were pro- 
vided during administration. No evidence was found that the 
children had difficulties understanding the procedure or the items 
in the measures. The administration of all measures was carried out 
by a group of psychology teachers and graduate students at Peking 
University. The first wave of data was collected in 2002. The 
participation rate was approximately 95% at each time. Written 
assent was obtained from all the children and written consent was 
obtained from all the parents. 


Measures 


Academic achievement. Children’s academic achievement in 
Chinese, mathematics, and English was obtained through school 
records. The scores of the three subjects were based on objective 
examinations conducted by the school. The maximum score for 
each subject was 100, and a score of 60 is usually considered the 
cutoff between a pass and a failure in a course. Chinese, mathe- 
matics, and English were three major subjects taught in Chinese 
schools, and the aggregated score of the grades of these three 
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subjects has been shown to be a valid measure of academic 
achievement in predicting relevant aspects of social, school, and 
psychological functioning (e.g., teacher-rated school competence 
and learning problems; self-perceptions of scholastic competence) 
in Chinese children (e.g., Chen et al., 1997, 2005, 2010). In the 
present study, scores on Chinese, mathematics, and English were 
significantly correlated (rs = .62—.85, ps < .001). Consistent with 
the approach used in previous studies (e.g., Chen et al., 2010), 
scores on the three subjects were averaged and standardized to 
form a single index of academic achievement in the present study. 
Internal reliabilities of the measure based on grades for the three 
subjects were .81 to .88 for Grades 2 to 6. One-year stabilities were 
.74 to .84 from Grade 2 to Grade 6. 

Peer assessments of social behaviors. We administered to 
the students a peer assessment measure of social behaviors, 
adapted from the Revised Class Play (Masten, Morison, & Pel- 
legrini, 1985). During administration, a research assistant read 
behavioral descriptors, and children were asked to nominate up to 
three classmates who could best play the role if they were to direct 
a class play. The total number of nominations each child received 
from all classmates was calculated and used to compute each item 
score for him or her. Children who received more nominations 
from the classmates for a role had higher scores on that item. 
Children who did not receive any nominations for an item received 
a score of zero. The item scores were standardized within the class 
(M = 0, SD = 1 for standardized scores) to adjust for differences 
in the number of nominators. The original Class Play measure 
consisted of items in broad areas of social functioning. The mea- 
sure in the present study consisted of items concerning sociability- 
leadership (e.g., “makes new friends easily,” “helps others when 
they need it”), aggression-disruption (e.g., “gets into a lot of 
fights,” “picks on other kids”), shyness-sensitivity (e.g., “very 
shy,” “feelings get hurt easily”, and unsociability (e.g., “rather play 
alone than with others,” “not interested in participating in activities 
with others”). Factor analysis indicated that the items represented 
the corresponding factors. Previous studies have shown that the 
measure is reliable and valid in Chinese children (see Chen et al., 
2011). In the present study, internal reliabilities were .96 for 
sociability, .91 for aggression, .69 for shyness, and .67 for unso- 
ciability. One-year stabilities were .70 to .88 for sociability, .78 to 
.87 for aggression, .60 to .74 for shyness, and .62 to .82 for 
unsociability from Grade 2 to Grade 6. 

Teacher ratings. The head teacher in each class was re- 
quested to complete the Teacher—Child Rating Scale (adapted from 
Hightower et al., 1986 and Schwartz, Chang, & Farver, 2001). 
Teachers were asked to rate, on a 5-point scale, ranging from 1 
(not true at all) to 5 (very true), how well each item described the 
child. This measure includes several subscales tapping school- 
related competence and problems, including social competence 
(e.g., “participates in class discussion,” “comfortable as a leader”), 
externalizing problems (e.g., “overly aggressive to peers [fights],” 
“disruptive in class”), internalizing problems (e.g., “nervous, 
frightened, tense,” “anxious, worried”), and victimization (e.g., 
“other children pick on this child,” “other children hit or push this 
child”). Factor analysis indicated four factors with the items loaded 
on the corresponding factor. The total scores on each subscale 
were computed and standardized within the class to adjust for 
teacher response styles and to allow for appropriate comparisons. 
The variables of externalizing and internalizing problems were 
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labeled as “acting out or aggression/acting-out” and “shy-anxious” 
in Hightower et al. (1986) and some other studies (e.g., Chen et al., 
2010). The items for these variables tapped broader constructs than 
aggression and shyness in peer assessments. To avoid potential 
confusion, we used the terms of externalizing and internalizing 
problems in this study, which was consistent with the approach in 
previous studies (e.g., Chen, Yang, & Wang, 2013; Rubin et al., 
1995). The Teacher—Child Rating Scale has proved to be reliable 
and valid in Chinese children (e.g., Chen et al., 1997). In the 
present study, internal reliabilities were .94 for competence, .81 for 
externalizing problems, .78 for internalizing problems, and .77 for 
victimization. One-year stabilities were .46 to .54 for competence, 
44 to 55 for externalizing problems, .30 to .36 for internalizing 
problems, and .30 to .43 for victimization. 


Results 


Descriptive Data 


A full-information maximum likelihood estimation was used to 
handle the missing data for students who had incomplete data on 
the variables (e.g., Graham, 2009). Little’s missing-completely-at- 
random test indicated that all variables in this study were missing 
completely at random, x7(395) = 373.78, p > .0S. A multivariate 
analysis of variance (MANOVA) was conducted to examine the 
overall effect of gender on all the variables. Significant effects of 
gender were found in all variables except for teacher-rated inter- 
nalizing problems, Wilks’ \ = .78, F(13, 1132) = 25.21, p < .01. 
Follow-up univariate analyses revealed that girls had better aca- 
demic performance than boys in Grades 2 to 6 (n* ranging from 
.02 to .05, p < .01). In addition, girls had higher scores on 
peer-and teacher-assessed sociability, peer-assessed shyness and 
unsociability than boys in Grade 2 (n? ranging from .03 to .07, p < 
.01). Boys had higher scores on peer-assessed aggression, teacher- 
assessed victimization and externalizing problems (17 ranging 
from .04 to .14, p < .01). Gender was included as a predictor in the 
subsequent growth mixture modeling (GMM). 

Following our research questions, the results are presented in 
three parts. First, we described different developmental trajectories 
of academic achievement, after the optimal number of latent 
classes was determined. Second, we examined the effects of peer- 
assessed and teacher-rated social-behavioral variables in Grade 2 
in predicting latent class membership. Third, we tested these 
variables as predictors of the initial status and growth rate of 
academic achievement within each trajectory. 


Table 1 

Fit Statistics for Growth Mixture Models 

Class Log-likelihood BIC aBIC AIC 
1 —3,921.66 7,911.89 7,880.13 7,863.31 
2 —2,858.66 5,854.47 5,790.95 Msieoe 
3 —2,658.66 5,488.75 5,409.35 5,367.31 
4 —2,523.71 5,266.86 5,165.23 5,111.42 
5 —2,393.15 5,074.32 4,940.93 4,870.30 
6 —2,318.55 4,993.69 4,828.54 4,741.09 


Trajectories of Academic Achievement 


Based on the linear growth model, a series of GMMs from one to 
six classes was tested and the fit indexes of each model were com- 
pared (see Table 1). Because GMMs with different classes were not 
nested, model fit comparisons were obtained from information in- 
dexes including the Bayesian information criterion (BIC), sample- 
size-adjusted BIC, and Akaike information criterion, which take into 
account the model log-likelihood while penalizing for model com- 
plexity (see Nylund, Asparouhov, & Muthén, 2007). Lower scores on 
these information indexes indicate a better fitting model. The Vuong- 
Lo-Mendell-Rubin likelihood ratio test (LRT), Lo-Mendell-Rubin 
LRT, and bootstrapped LRT (BLRT) generally apply a corrected 
likelihood-ratio distribution to compare models with c and c — 1 
unobserved groups (Muthén, 2004; Nylund et al., 2007). A statisti- 
cally significant p value suggests the c — 1 class model should be 
rejected in favor of the current c class model (Feldman, Masyn, & 
Conger, 2009). Another indicator, entropy, indicates a summary of 
classification accuracy with which all cases are classified into ex- 
tracted latent classes (Lubke & Muthén, 2007; Muthén, 2004). The 
value of entropy ranges from 0 to 1, with high values closer to 1 
indicating higher classification accuracy. However, the optimal num- 
ber of different classes should be determined by a combination of 
factors including acceptable fit indexes and tests, successful conver- 
gence, no less than 1% of total count in each class, and high posterior 
probabilities (Jung & Wickrama, 2008). Recent studies have shown 
that among all the fit indexes and tests, the BLRT performs the best, 
followed by BIC in determining the optimal number of classes in a 
growth mixture model (e.g., Nylund et al., 2007). As shown in Table 
1, compared to the model with three and five classes, a significant 
value of BLRT and a smaller value of BIC suggested that the GMM 
with four classes should be chosen, as it best balanced goodness-of-fit, 
model parsimony, and interpretability. 

After the optimal number of latent classes was determined, the 
four-class GMM model was estimated with covariates of interest 
(i.e., peer-reported and teacher-rated social behaviors and adjust- 
ment) to further test the predicting effects of covariates on class 
membership probabilities and on the initial status and growth rate 
within each trajectory. Because an unconditional model (i.e., with- 
out covariates) may lead to distorted results analogous to a mis- 
specified regression model, it is crucial to include covariates to 
offer auxiliary information needed for a more precise classifica- 
tion. Therefore, a conditional GMM predicts an individual’s esti- 
mated probability of class membership and this individual’s 
growth trend, relative to the mean trajectory in each class (e.g., 


VLMR-LRT LMR-LRT BLRT Entropy 
N/A N/A N/A N/A 
<.001 <.001 <.001 82 
<.001 <.001 <.001 84 
<.001 <.001 <.001 ao 
<.001 <.001 >.05 13 
>.05 >.05 >.05 aii 


Oe 
Note. BIC = Bayesian information criterion; aBIC = sample-size adjusted BIC; AIC = Akaike information criterion; VLMR-LRT = Vuong-Lo- 
Mendell-Rubin likelihood ratio test; LMR-LRT = Lo-Mendell-Rubin likelihood ratio test; BLRT = Bootstrap likelihood ratio test. 
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Feldman et al., 2009; Muthén, 2004; Reinecke & Seddig, 2011). 
When peer- and teacher-assessed covariates were included in the 
current study and allowed to have varying influence in different 
classes, the four-class solution remained stable (e.g., significantly 
lower values of BIC and log-likelihood; entropy > .7; changes in 
the classification proportions are less than 1%; e.g., Muthén, 2004; 
Reinecke & Seddig, 2011). 

Four distinct trajectories of academic achievement were identi- 
fied and the corresponding model-estimated mean trajectories are 
depicted in Figure 1. The trajectories included (a) a low-stable 
trajectory (224 children, 24% of the sample, initial M = —.52, 
approximately 65 on the 0-100 scale): This trajectory started low 
in Grade 2 and remained at the lowest level throughout the 5-year 
period; (b) a high/moderate-decreasing trajectory (251 children, 
26% of the sample, initial M = .32, 79 on the 0-100 scale): 
Children in this trajectory showed a moderately high level of 
academic achievement in Grade 2 but decreased thereafter 
throughout the 5-year period; (c) a high-increasing trajectory (343 
children, 36% of the sample, initial M = .38, 81 on the 0-100 
scale): Their academic achievement was high in Grade 2 (i.e., 
higher than the decreasing class but lower than the high-stable 
class) and exhibited subsequent growth; and (d) a high-stable 
trajectory (133 children, 14% of the sample, initial M = .46, 82 on 
the 0-100 scale): This trajectory started the highest in Grade 2 and 
remained stable throughout the 5-year period. The latent means 
and standard errors of covariates are presented in Figure 2. 


The Effects of Early Social-Behavioral Variables on 
Latent Class Membership 


The effects of covariates of social behaviors and adjustment on 
latent class membership are estimated by a multinomial logit model 
and the results are shown in Table 2. First, with the current four-class 
solution, the low-stable class was designated as a reference group. The 
log odds of being in the high/moderate-decreasing, high-increasing, 
and high-stable classes and the extent to which each covariate distin- 
guishes class membership were compared with this reference group. 
Results showed that peer-assessed sociability differentiated the low- 
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stable class from the other three classes. Children with a higher initial 
level of sociability in Grade 2 were less likely to be in the low-stable 
class than in the other three classes (estimates = 1.14, 1.38, and 1.48, 
SEs = .38, .33, and .34, odds ratios [ORs] = 3.12, 3.97, and 4.40). In 
contrast, children who exhibited more externalizing behavior prob- 
lems were more likely to be classified in the low-stable class than any 
other trajectory (estimates = —.35, —.33, and —.43, SEs = .17, .15, 
and .21, ORs = .83, .72, and .65). In addition, children with higher 
teacher-rated social competence in Grade 2 were more likely to be in 
the high-increasing class than in the low-stable class (estimates = .51, 
SEs = .16, ORs = 1.66). Second, the high-increasing class was 
chosen as the reference group. Teacher-rated social competence dif- 
ferentiated this group from other classes. Finally, using the high/ 
moderate-decreasing class as the reference group, gender differenti- 
ated the high-stable class from this reference class. Compared to boys, 
girls had a greater likelihood of being in the high-stable class than in 
the high/moderate-decreasing class. 


The Effects of Early Social-Behavioral Variables 
on the Initial Status and Growth Rate 
Within Trajectories 


As GMMs essentially assume that the population of interest 
consists of heterogeneous subpopulations with varying parameters, 
covariates can be used to predict individual differences of the 
initial status and growth rates within each class. In the present 
model, the intercept and slope of academic achievement within the 
trajectory were regressed on peer-assessed and teacher-rated social 
behaviors and adjustment. As shown in Table 3, all the covariates 
except for peer-assessed aggression in Grade 2 were significant 
predictors to within-class intercepts or slopes. 

Among children classified in the high/moderate-decreasing tra- 
jectory, those with higher teacher-rated social competence had a 
higher initial level of academic achievement. Their academic 
achievement also decreased to a lesser extent than others in this 
class, suggesting that social competence may serve as a protective 
factor for preventing decrease in academic success. For children in 
the high-increasing trajectory of academic achievement, higher 
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Figure I. Developmental trajectories of academic achievement estimated from the four-class growth mixture 
model. See the online article for the color version of this figure. 
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Figure 2. Means and standard errors (error bars) of covariates of four trajectories. See the online article for the 
color version of this figure. 


peer-assessed sociability and shyness and teacher-rated social 
competence and lower unsociability were associated with a higher 
initial status of academic achievement. Throughout the 5-year 
period, children who displayed higher shyness and more external- 
izing problems tended to have a lower increase rate in academic 
growth. For children in the low-stable class, where individual 
academic achievement trajectory started and remained at the low- 
est level over time, higher peer-assessed shyness and teacher-rated 
social competence predicted relatively higher initial achievement 
in academic subjects. On the contrary, higher unsociability, more 


internalizing problems and more victimization experiences pre- 
dicted lower initial status of academic achievement within this 
category. For children in the high-stable class, only teacher-rated 
victimization was marginally significant in predicting the initial 
status of academic achievement. 


Discussion 


In this five-wave longitudinal study, we investigated academic 
trajectories and their social-behavioral predictors among Chinese 


Table 2 
The Association Between Trajectory Class Membership and Social-Behavioral Variables 
High/moderate- 
High/moderate- decreasing High-stabie High-stable 
decreasing High-increasing High-stable (vs. high- (vs. high- (vs. high/mod- 
(vs. low-stable) (vs. low-stable) (vs. low-stable) increasing) increasing) decreasing) 
Covariates OR Est. SE “OR Est. SE "OR Est. SE OR Est. SE FORE St SE) Olen = siemens 
SR Os ee ee Se 
Sex 1.14 13 82 Plo Seema Ole ec AS tee lee 03 oie SO rail! 54 44 2.40 Saat oil 
Peer assessments 
Sociability Bee Ea Cees eS Omens e | 55" ua 408 AS 888 34 BOT ON ee Zeal) 10 18 141 34 ero 
Aggression 92209 4 283 = 19 ime O2" =.08 25 lO LOR 2OV er tl 1.24 )=-1.00 00 .29 
Shyness 1.18 HG e ESI 98.01 Dim" —24 33 uMF18 SOP TRS QM 299 Fe Ol 33) ATO TONS 8 Oe 240 
Unsociability sSlRee 221 BD ae SOs e723, 28, Oe —.03 Soe 02 01 Sle? | LOD tS Tin pee BA eer LS O 
Teacher assessments a é 
Social comp. 1.01 01 sie wee; lee Oma) LYE DS OO VY HAAS eR S| Tella oS meno 
Ext. prob. Koes Sty a 3F eo OS AS 2 TG wll, 2) ee ie ene ee oe A eon 
Int. prob. 1.02 02 Sper Oe SOs Geos. —.11 OMT ot SSM OSE S02 MITA S See S wae 
Victimization C8pviaal3 Asse OS ae 5e —-05 39,9 86" 716 16 1.02 02 M3685 ald a A? 


Note. OR = odds ratio; comp. = competence; Ext. = externalizing; prob. = problems; Int. = internalizing. 
fpr 05 tap peel 
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Table 3 


FU, CHEN, WANG, AND YANG 


Parameter Estimates for the Four-Class GMM With Social-Behavioral Variables 


Parameter High/moderate-decreasing 


Predicting intercept/initial status 


Sex —.01 (.03) 
Peer assessments 
Sociability —.02 (.02) 
Aggression —.02 (.02) 
Shyness —.01 (.03) 
Unsociability .O1 (.04) 
Teacher assessments 
Social comp. .07 (.03)* 
Ext. prob. .02 (.02) 
Int. prob. —.01 (.03) 
Victimization .O1 (.02) 
Predicting slope/growth rate 
Sex —.01 (.03) 
Peer assessments 
Sociability —.01 (.02) 
Aggression .O1 (.02) 
Shyness —.01 (.03) 
Unsociability .O1 (.03) 


Teacher assessments 


Social comp. .05 (.02)* 
Ext. prob. —.02 (.02) 
Int. prob. .02 (.02) 
Victimization —.02 (.03) 


High-increasing High-stable Low-stable 
.00 (.03) —,02 (.02) —.04 (.15) 
.04 (.02)* 01 (.01) 31 (.20) 
01 (.03) .01 (.01) .05 (.08) 
.08 (.03)* —,.03 (.02) 25 Gay 

—.08 (.04)* 02 (.02) —.42 (.12)"* 
.05 (.02)* 01 (.01) ‘34 (08) as 
01 (.02) 01 (.02) .05 (.05) 
01 (.02) .00 (.01) —.15 (.07)* 

—.07 (.04) —.02 (.01)* —.20 (.06)** 

—.01 (.01) .01 (.01) .14 (.05)* 
01 (.01) 01 (.01) —.02 (.09) 
01 (.01) —.01 (.01) .02 (.03) 

—.02 (.01)* 02 (.02) —.01 (.05) 
01 (.01) —.02 (.02) 02 (.05) 
01 (.01) .01 (.01) —.05 (.03) 

—,.02 (.01)* —.01 (.01) —.01 (.03) 
01 (.01) 01 (.01) .03 (.02) 
01 (.01) 01 (.01) 03 (.03) 


Note. GMM = growth mixture modeling; comp. = competence; Ext. = externalizing; prob. = problems; Int. = internalizing. SEs are in parentheses after 


estimates. 
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children. Four academic trajectories emerged: low-stable, high/ 
moderate-decreasing, high-increasing, and high-stable, differenti- 
ated mainly by early social competence and externalizing behay- 
iors. In addition, social competence, externalizing behaviors, 
internalizing behaviors, social withdrawal, and victimization ex- 
periences differentially predicted the initial status and growth 
within trajectories. These findings revealed the heterogeneous 
patterns of development of academic achievement and the role of 
early social-behavioral functioning in predicting different aca- 
demic trajectories in Chinese children. 


Distinct Academic Trajectories Among 
Chinese Children 


Four trajectories were identified in the study: high-stable, high- 
increasing, high/moderate-decreasing, and low-stable. We did not 
find a low-increasing class, which was different from the results 
typically found in Western samples (e.g., Hao & Woo, 2012; Ladd 
& Dinella, 2009). As indicated earlier, in the Chinese school 
systems, children with low academic performance may face seri- 
ously unfavorable conditions, such as increasing difficulty in core 
curriculum, heightened academic pressure, and lack of social support 
and assistance. These factors make it very difficult for low-achieving 
children to substantially improve their academic performance over 
time. It should be noted that the lack of a low-increasing class may not 
be the only difference between Chinese and Western children in their 
academic development. It will be important to conduct further re- 
search in Chinese and Western children in order to achieve a more 
complete understanding of academic trajectories of children in differ- 
ent societies. 


Our results showed that gender differentiated the trajectories of 
academic achievement. Compared to boys, girls were more likely 
to display high-achieving developmental patterns (i.e., high- 
increasing and high-stable trajectories). These results were consis- 
tent with existing findings that girls tend to perform better than 
boys academically (e.g., Chen et al., 2013; Lam et al., 2012). 
Among the social-behavioral predictors, peer-assessed sociability, 
teacher-rated social competence, and externalizing problems pre- 
dicted distinct academic trajectories. Children who exhibited high 
sociability and social competence in the early years were more 
likely to be in the high-increasing and high-stable trajectories than 
in the high/moderate-decreasing and low-stable trajectories. So- 
cially competent children are more likely than others to gain 
emotional and social resources that are conducive to learning and 
academic achievement. Thus, socially competent behaviors help 
these children maintain or promote their high academic achieve- 
ment over time. 

In contrast, externalizing problem behaviors particularly 
characterized a stabilized pattern of the low-achieving group. 
Research has indicated that aggressive, disruptive, and other 
externalizing behaviors are related with significant academic 
deficiencies in Chinese children (Chen et al., 2010; Zhou et al., 
2010). Aggressive and externalizing behaviors are perceived as 
highly problematic and children who display these behaviors 
are often rejected by peers, which may prevent them from 
getting academic support. Thus, it is not surprising that children 
with early externalizing problems were likely to be in the 
low-stable trajectory. Taken together, our results suggested that 
early social competence and behavioral problems play signifi- 
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cant roles in distinguishing and defining academic trajectories 
among Chinese children. 

We focused on the academic achievement trajectories during the 
elementary school years and these trajectories may impact aca- 
demic performance in the later years. Children identified in the 
four trajectories are likely to continue to develop in the same 
growth patterns. For example, it is argued that desirable conditions 
may serve to potentiate personal capacities and enhance adaptive 
development, whereas adverse conditions may suppress or inhibit 
adaptive development (Kupersmidt, Griesler, DeRosier, Patterson, 
& Davis, 1995). Thus, high and increasing patterns of academic 
achievement in elementary schools are beneficial to the acquisition 
of support from others, which may promote high-achievers’ con- 
fidence and participation in academic learning in the later years. In 
contrast, low and decreasing patterns of academic achievement in 
elementary schools are likely to undermine individual learning and 
involvement in academic activities, leading to continuous aca- 
demic difficulties in high schools. Moreover, given the current 
findings on the contributions of early social-behavioral functioning 
to academic trajectories in middle to late childhood, it will be 
interesting to examine how early social behaviors and problems 
predict academic development in adolescence. 


Social-Behavioral Predictors of the Initial Status and 
Growth Rate Within Trajectories 


Our results also revealed that social-behavioral variables have 
differential effects on the initial status and growth rate of academic 
achievement within different trajectories. Specifically, early 
teacher-rated social competence was positively associated with 
relatively higher initial levels of academic achievement for chil- 
dren in the high/moderate-decreasing, high-increasing, and low- 
stable trajectories. Teacher-rated social competence also predicted 
a lower rate of academic decline for children in the high/moderate- 
decreasing trajectory, suggesting that early social competence 
serves as a protective factor in the decline of academic achieve- 
ment. This protective function of social competence is likely to 
involve both interpersonal and intrapersonal processes. Socially 
competent children are skilled at forming and maintaining positive 
classroom interactions that lead to instrumental support, which in 
turn helps them handle their learning problems (e.g., Chen et al., 
2013). Moreover, the favorable social evaluations that socially 
competent children receive may help them maintain a positive 
school attitude and enhance their confidence in coping with aca- 
demic frustration and failures (Jia et al., 2009). 

Our results indicated that in contrast to social competence, 
social-behavioral problems predicted lower initial levels or lower 
growth rates of academic achievement. For example, externalizing 
behaviors predicted less academic growth for children in the 
high-increasing trajectory. That is, although children in this group 
initially performed well on academic tasks, those who displayed 
externalizing behaviors had relatively less growth in the later 
years. It should be noted that, on average, children in this group 
displayed fewer externalizing problems, especially compared to 
children in the high/moderate-decreasing and low-stable trajecto- 
ries. The externalizing behaviors displayed by children in this 
group might be relatively less severe and extensive, which might 
not result in major learning problems but still inhibit academic 
growth. To become academically stronger, children need to con- 
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tinue to discuss, cooperate, and learn from other competent peers. 
Even moderate levels of aggressive and disruptive behavior might 
adversely affect these activities, which might be particularly the 
case in China where group harmony is highly valued (Chen & 
French, 2008). In addition, to make continuous academic progress, 
children need to deal with academic difficulties effectively. Exter- 
nalizing behaviors may prevent them from concentrating on the 
schoolwork and regulating their emotional distress to overcome 
academic obstacles (Zhou et al., 2010). Therefore, although chil- 
dren in the high-increasing group generally perform well, exter- 
nalizing behaviors serve to attenuate their academic growth. 

Unsociability predicted lower initial achievement among children 
in both high-increasing and low-stable trajectories. The results were 
consistent with the literature indicating the negative effect of unso- 
ciability on academic achievement among Chinese children (e.g., 
Chen et al., 2011; Liu, Coplan et al., 2014). Interestingly, shyness was 
associated with better initial academic achievement for children in 
these two trajectories, which seems to support the view that unsocia- 
bility and shyness are distinct types of social-withdrawal that have 
different functional meanings (Chen et al., 2011). The early positive 
association between shyness and academic achievement does not 
imply that shyness necessarily facilitates academic growth. In fact, a 
higher level of shyness predicted lower achievement growth rate, 
particularly among children in the high-increasing trajectory. These 
results need to be understood in the Chinese context. In Chinese 
schools, children who are high achievers initially benefit from oppor- 
tunities and resources to make academic progress in the later years. 
For example, these children are likely to be elected by peers and 
teachers as class or school leaders, encouraged to participate in 
academic competitions, and invited to share their ideas in class, which 
are important for them to further enhance their achievement. How- 
ever, these opportunities often involve group-level interactions and 
possibly evoke feelings of anxiety for shy children. Shy children tend 
to display constrained and vigilant behaviors in social-evaluative 
settings, which may lower the benefits of opportunities to enhance 
academic achievement (Asendorpf, 2010). As shyness is no longer 
viewed positively in today’s urban China (e.g., Chen et al., 2009), shy 
children appear to have increasing difficulties getting peer support and 
social resources. Therefore, although children in the high-increasing 
category have a generally growing trend in academic achievement, 
those who are relatively shy do not enjoy as much growth as their 
nonshy peers. 

Finally, we found that internalizing problems and victimization 
experiences were significant predictors of low initial academic 
achievement for children who were in the low-stable class. The 
psychological distress that children with internalizing problems expe- 
rience may make them unable to effectively cope with their academic 
difficulties. Moreover, the symptoms they display may be viewed as 
deviant, which may impede their social affiliation and connectedness. 
As a result, these children are less likely than others to have support- 
ive peer networks to help them with school tasks when needed. The 
experience of victimization may also function as an external stressor 
that reduces children’s abilities to cope with school demands and 
leads to negative attitudes toward school (e.g., Schwartz et al., 2005). 
Taken together, children who have internalizing problems and who 
are victimized appear to be in a particularly adverse condition that 
makes their academic performance worse, even compared to other 
underachieving students without such problems. 
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Our results showed that early social behaviors and problems 
predicted the initial status of academic performance in the low- 
stable class more evidently and consistently than in other classes. 
The results indicated the exacerbating effects of initially poor 
social-behavioral functioning on the academic performance of 
children in the low-stable class. Poor social-behavioral functioning 
may create an unfavorable social context, such as peer rejection 
and lack of support and assistance on academic work, which is 
likely to have a salient adverse impact on the academic perfor- 
mance of underachieving children. The high vulnerability of chil- 
dren in the low-stable class to the effects of social-behavioral 
problems may also be related to their emotional reactions such as 
frustration resulting from frequent experiences of academic diffi- 
culties (e.g., Chen et al., 1997, 2013). 


Limitations and Conclusions 


Several weaknesses and limitations in this study should be 
noted. First, this study focused on the contributions of social 
behaviors and problems to academic trajectories in Chinese chil- 
dren. Considering the emphasis on academic achievement in 
China, it will be important to examine whether academic achieve- 
ment is associated with heterogeneity and individual differences in 
the development of social behaviors (e.g., Booth-LaForce & Ox- 
ford, 2008; Oh et al., 2008). For example, it has been found that 
academic achievement protects shy Chinese children from devel- 
oping psychological difficulties (Chen et al., 2013). It will be 
interesting to investigate whether and how academic achievement 
contributes to different developmental patterns of shyness and 
other social behaviors in Chinese children. 

Second, we examined the relations between indexes of early social- 
behavioral functioning and academic trajectories mainly to under- 
stand the role of children’s social behaviors and problems in identi- 
fying distinctive academic trajectories and in predicting academic 
growth patterns within the trajectories. However, these data were 
correlational in nature so one should be careful in interpreting the 
results in terms of causality. It is possible, for example, that social- 
behavioral problems in Grade 2 resulted from poor academic perfor- 
mance in kindergarten and Grade 1. Researchers should investigate 
causal directions between social-behavioral functioning and academic 
achievement using different approaches (e.g., experimental designs, 
longitudinal cross-lagged panel analyses). 

Third, the present study focused on social behaviors as predic- 
tors of distinctive academic trajectories. The contributions of so- 
cial behaviors to academic performance likely occur in broader 
social contexts such as families, peers, and schools. It will be 
important in future research to explore how social contexts play a 
role in shaping the relations between social-behavioral functioning 
and academic trajectories in Chinese children. 

Despite the limitations, the present study made several major con- 
tributions to our understanding of children’s academic achievement. 
First, the study revealed the developmental heterogeneity of academic 
achievement in the Chinese cultural context. Previous research on 
academic achievement among Chinese children has treated them as a 
uniform group. Our findings indicate that multiple developmental 
patterns exist and that the distinct trajectories emerged may be related 
to the influence of Chinese sociocultural circumstances. Second, our 
results indicated the significant role of early social-behavioral func- 
tioning in differentiating the academic trajectories among Chinese 
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children. The results have implications for identifying children at risk 
for low academic performance and for designing programs to help 
these children through promoting their social-behavioral functioning. 
For example, researchers, professionals, and educators may consider 
an integrative approach to incorporate the enhancement of social- 
behavioral competencies into prevention and intervention programs 
for children who have academic difficulties. Third, the multiwave 
data in this study allowed us to examine the effects of social- 
behavioral factors on the initial status and growth rate within each 
academic trajectory, which were rarely conducted in previous studies. 
The results suggest that the link between social-behavioral function- 
ing and academic performance needs to be understood for children in 
specific academic trajectories. 


References 


Asendorpf, J. B. (1990). Beyond social withdrawal: Shyness, unsociability, 
and peer avoidance. Human Development, 33, 250-259. http://dx.doi 
.org/10.1159/000276522 

Asendorpf, J. B. (2010). Long-term development of shyness: Looking 
forward and looking backward. In K. H. Rubin & R. J. Coplan (Eds.), 
The development of shyness and social withdrawal (pp. 157-175). New 
York, NY: Guilford Press. Retrieved from http://www.guilford.com/ 
books/The-Development-of-Shyness-and-Social-Withdrawal/Rubin- 
Coplan/978 1606235225 

Aunola, K., Leskinen, E., Lerkkanen, M.-K., & Nurmi, J.-E. (2004). 
Developmental dynamics of math performance from preschool to grade 
2. Journal of Educational Psychology, 96, 699-713. http://dx.doi.org/ 
10.1037/0022-0663.96.4.699 

Booth-Laforce, C., & Oxford, M. L. (2008). Trajectories of social with- 
drawal from Grades | to 6: Prediction from early parenting, attachment, 
and temperament. Developmental Psychology, 44, 1298-1313. http://dx 
.doi.org/10.1037/a0012954 

Caprara, G., Fida, R., Vecchione, M., Del Bove, G., Vecchio, G., Bar- 
baranelli, C., & Bandura, A. (2008). Longitudinal analysis of the role of 
perceived self-efficacy for self-regulated learning in academic continu- 
ance and achievement. Journal of Educational Psychology, 100, 525— 
534. http://dx.doi.org/10.1037/0022-0663.100.3.525 

Chang, L. (2004). The role of classroom norms in contextualizing the 
relations of children’s social behaviors to peer acceptance. Developmen- 
tal Psychology, 40, 691-702. http://dx.doi.org/10.1037/0012-1649.40.5 
691 

Chang, L., Lei, L., Li, K. K., Liu, H., Guo, B., Wang, Y., & Fung, K. Y. 
(2005). Peer acceptance and self-perceptions of verbal and behavioral 
aggression and social withdrawal. International Journal of Behavioral 
Development, 29, 48-57. http://dx.doi.org/10.1080/0165025044 
4000324 

Chen, Q., Hughes, J. N., & Kwok, O.-M. (2014). Differential growth 
trajectories for achievement among children retained in first grade: A 
growth mixture model. The Elementary School Journal, 114, 327-353. 
http://dx.doi.org/10.1086/674054 

Chen, X., Cen, G., Li, D., & He, Y. (2005). Social functioning and 
adjustment in Chinese children: The imprint of historical time. Child 
Development, 76, 182-195. http://dx.doi.org/10.1111/j.1467-8624.2005 
.00838.x 

Chen, X., & French, D. C. (2008). Children’s social competence in cultural 
context. Annual Review of Psychology, 59, 591-616. http://dx.doi.org/ 
10.1146/annurev.psych.59.103006.093606 

Chen, X., Huang, X., Chang, L., Wang, L., & Li, D. (2010). Aggression, 
social competence, and academic achievement in Chinese children: A 
5-year longitudinal study. Development and Psychopathology, 22, 583— 
592. http://dx.doi.org/10.1017/S0954579410000295 


ACADEMIC ACHIEVEMENT TRAJECTORIES 


Chen, X., Rubin, K. H., & Li, D. (1997). Relation between academic 
achievement and social adjustment: Evidence from Chinese children. 
Developmental Psychology, 33, 518-525. http://dx.doi.org/10.1037/ 
0012-1649.33.3.518 

Chen, X., Rubin, K. H., & Li, Z. (1995). Social functioning and adjustment 
in Chinese children: A longitudinal study. Developmental Psychology, 
31, 531-539. http://dx.doi.org/10.1037/0012-1649.31.4.531 

Chen, X., Rubin, K. H., & Sun, Y. (1992). Social reputation and peer 
relationships in Chinese and Canadian children: A cross-cultural study. 
Child Development, 63, 1336-1343. http://dx.doi.org/10.2307/1131559 

Chen, X., Wang, L., & Cao, R. (2011). Shyness-sensitivity and unsocia- 
bility in rural Chinese children: Relations with social, school, and 
psychological adjustment. Child Development, 82, 1531-1543. http://dx 
-doi.org/10.1111/j.1467-8624.2011.01616.x 

Chen, X., Wang, L., & Wang, Z. (2009). Shyness-sensitivity and social, 
school, and psychological adjustment in rural migrant and urban children 
in china. Child Development, 80, 1499-1513. http://dx.doi.org/10.1111/ 
j.1467-8624.2009.01347.x 

Chen, X., Yang, F., & Wang, L. (2013). Relations between shyness- 
sensitivity and internalizing problems in Chinese children: Moderating 
effects of academic achievement. Journal of Abnormal Child Psychol- 
ogy, 41, 825-836. http://dx.doi.org/10.1007/s10802-012-9708-6 

Coplan, R. J., & Armer, M. (2007). A “multitude” of solitude: A closer 
look at social withdrawal and nonsocial play in early childhood. Child 
Development Perspectives, 1, 26-32. http://dx.doi.org/10.1111/j.1750- 
8606.2007.00006.x 

Coplan, R. J., & Weeks, M. (2010). Unsociability in middle childhood: 
Conceptualization, assessment, and associations with socio-emotional 
functioning. Merrill-Palmer Quarterly, 56, 105-130. 

Feldman, B. J., Masyn, K. E., & Conger, R. D. (2009). New approaches to 
studying problem behaviors: A comparison of methods for modeling 
longitudinal, categorical adolescent drinking data. Developmental Psy- 
chology, 45, 652-676. http://dx.doi.org/10.1037/a0014851 

Fuligni, A. J., Tseng, V., & Lam, M. (1999). Attitudes toward family 
obligations among American adolescents from Asian, Latin American, 
and European backgrounds. Child Development, 70, 1030-1044. http:// 
dx.doi.org/10.1111/1467-8624.00075 

Graham, J. W. (2009). Missing data analysis: Making it work in the real 
world. Annual Review of Psychology, 60, 549-576. http://dx.doi.org/10 
.1146/annurev.psych.58.110405.085530 

Hao, L., & Woo, H. S. (2012). Distinct trajectories in the transition to 
adulthood: Are children of immigrants advantaged? Child Development, 
83, 1623-1639. http://dx.doi.org/10.1111/j.1467-8624.2012.01798.x 

Hightower, A. D., Work, W. C., Cohen, E. L., Lotyczewski, B. S., Spinell, 
A. P., Guare, J. C., & Rohrbeck, C. A. (1986). The Teacher-Child Rating 
Scale: A brief objective measure of elementary children’s school prob- 
lem behaviors and competencies. School Psychology Review, 15, 393- 
409. 

Ho, D. Y. F. (1986). Chinese pattern of socialization: A critical review. In 
M. H. Bond (Ed.), The psychology of the Chinese people (pp. 1-37). 
New York, NY: Oxford University Press. Retrieved from http://psycnet. 
apa.org/PsycINFO/1987-97682-001 

Hodis, F. A., Meyer, L. H., McClure, J., Weir, K. F., & Walkey, F. H. 

~ (2011). A longitudinal investigation of motivation and secondary school 
achievement using growth mixture modeling. Journal of Educational 
Psychology, 103, 312-323. http://dx.doi.org/ 10.1037/a0022547 

Jia, Y., Way, N., Ling, G., Yoshikawa, H., Chen, X., Hughes, D.,... Lu, 
Z. (2009). The influence of student perceptions of school climate on 
socioemotional and academic adjustment: A comparison of Chinese and 
American adolescents. Child Development, 80, 1514-1530. http://dx.doi 
,org/10.1111/j.1467-8624.2009.01348.x 

Jung, T., & Wickrama, K. A. S. (2008). An introduction to latent class 
growth analysis and growth mixture modeling. Social and Personality 


1011 


Psychology Compass, 2, 302-317. http://dx.doi.org/10.1111/j.1751- 
9004.2007.00054.x 

Krueger, R. F., Caspi, A., Moffitt, T. E., White, J., & Stouthamer-Loeber, 
M. (1996). Delay of gratification, psychopathology, and personality: Is 
low self-control specific to externalizing problems? Journal of Person- 
ality, 64, 107-129. http://dx.doi.org/10.1111/j.1467-6494.1996 
.tb00816.x 

Kupersmidt, J. B., Griesler, P. C., DeRosier, M. E., Patterson, Cry ee 
Davis, P. W. (1995). Childhood aggression and peer relations in the 
context of family and neighborhood factors. Child Development, 66, 
360-375. http://dx.doi.org/10.2307/1131583 

Ladd, G. W., & Dinella, L. M. (2009). Continuity and change in early 
school engagement: Predictive of children’s achievement trajectories 
from first to eighth grade? Journal of Educational Psychology, 101, 
190-206. http://dx.doi.org/10.1037/a0013153 

Lam, S. F., Jimerson, S., Kikas, E., Cefai, C., Veiga, F. H., Nelson, B., .. . 
Zollneritsch, J. (2012). Do girls and boys perceive themselves as equally 
engaged in school? The results of an international study from 12 coun- 
tries. Journal of School Psychology, 50, 77-94. http://dx.doi.org/10 
.1016/j.jsp.2011.07.004 

Liu, J., Bullock, A., & Coplan, R. J. (2014). Predictive relations between 
peer victimization and academic achievement in Chinese children. 
School Psychology Quarterly, 29, 89-98. http://dx.doi.org/10.1037/ 
spq0000044 

Liu, J., Coplan, R. J., Chen, X., Li, D., Ding, X., & Zhou, Y. (2014). 
Unsociability and shyness in Chinese children: Concurrent and predic- 
tive relations with indices of adjustment. Social Development, 23, 119— 
136. http://dx.doi.org/10.1111/sode.12034 

Liu, J., Zhou, Y., & Li, D. (2012). School adjustment and internalizing 
problems in Chinese adolescents: Implications of social change. Social 
Behavior and Personality, 40, 657-666. http://dx.doi.org/10.2224/sbp 
.2012.40.4.657 

Lubke, G., & Muthén, B. (2007). Performance of factor mixture models as 
a function of model size, covariate effects, and class-specific parameters. 
Structural Equation Modeling, 14, 26—47. http://dx.doi.org/10.1080/ 
10705510709336735 

Masten, A. S., Morison, P., & Pellegrini, D. S. (1985). A revised class play 
method of peer assessment. Developmental Psychology, 21, 523-533. 
http://dx.doi.org/10.1037/0012-1649.21.3.523 

McClelland, M. M., Acock, A. C., & Morrison, F. J. (2006). The impact of 
kindergarten learning-related skills on academic trajectories at the end of 
elementary schoo]. Early Childhood Research Quarterly, 21, 471-490. 
http://dx.doi.org/10.1016/j.ecresq.2006.09.003 

Morison, P., & Masten, A. S. (1991). Peer reputation in middle childhood 
as a predictor of adaptation in adolescence: A seven-year follow-up. 
Child Development, 62, 991-1007. http://dx.doi.org/10.2307/1131148 

Muthén, B. (2004). Latent variable analysis: Growth mixture modeling and 
related techniques for longitudinal data. In D. Kaplan (Ed.), Handbook 
of quantitative methodology for the social sciences (pp. 345-368). 
Newbury Park, CA: Sage. http://dx.doi.org/10.4135/9781412986311 
nig 

Nylund, K. L., Asparouhoy, T., & Muthén, B. (2007). Deciding on the 
number of classes in latent class analysis and growth mixture modeling: 
A Monte Carlo simulation study. Structural Equation Modeling, 14, 
535-569. http://dx.doi.org/10.1080/107055 10701575396 

Oh, W., Rubin, K. H., Bowker, J. C., Booth-LaForce, C., Rose-Krasnor, L., 
& Laursen, B. (2008). Trajectories of social withdrawal from middle 
childhood to early adolescence. Journal of Abnormal Child Psychology, 
36, 553-566. http://dx.doi.org/10.1007/s10802-007-9199-z 

Phillipson, S., & Phillipson, S. N. (2007). Academic expectations, belief of 
ability, and involvement by parents as predictors of child achievement: 
A cross-cultural comparison. Educational Psychology, 27, 329-348. 
http://dx.doi.org/10.1080/01443410601 104130 


1012 


Reinecke, J., & Seddig, D. (2011). Growth mixture models in longitudinal 
research. Advances in Statistical Analysis, 95, 415-434. http://dx.doi 
.org/10.1007/s10182-011-0171-4 

Rubin, K. H., Chen, X., McDougall, P., Bowker, A., & McKinnon, J. 
(1995). The Waterloo Longitudinal Project: Predicting internalizing and 
externalizing problems in adolescence. Development and Psychopathol- 
ogy, 7, 751-764. http://dx.doi.org/10.1017/S0954579400006829 

Schwartz, D., Chang, L., & Farver, J. M. (2001). Correlates of victimiza- 
tion in Chinese children’s peer groups. Developmental Psychology, 37, 
520-532. http://dx.doi.org/10.1037/0012-1649.37.4.520 

Schwartz, D., Gorman, A. H., Nakamoto, J., & Toblin, R. L. (2005). 
Victimization in the peer group and children’s academic functioning. 
Journal of Educational Psychology, 97, 425-435. http://dx.doi.org/10 
.1037/0022-0663.97.3.425 

Stevenson, H. W., Chen, C., & Lee, S. Y. (1993). Mathematics achieve- 
ment of Chinese, Japanese, and American children: Ten years later. 
Science, 259, 53-58. http://dx.doi.org/10.1126/science.8418494 


FU, CHEN, WANG, AND YANG 


Wentzel, K. R. (2005). Peer relationships, motivation, and academic per- 
formance at school. In A. J. Elliot & C. S. Dweck (Eds.), Handbook of 
competence and motivation (pp. 279-296). New York, NY: Guilford 
Press. Retrieved from http://www.guilford.com/books/Handbook-of- 
Competence-and-Motivation/Elliot-Dweck/978 1593856069 

Zhou, Q., Main, A., & Wang, Y. (2010). The relations of temperamental 
effortful control and anger/frustration to Chinese children’s academic 
achievement and social adjustment: A longitudinal study. Journal of 
Educational Psychology, 102, 180-196. http://dx.doi.org/10.1037/ 
a0015908 


Received May 15, 2015 
Revision received October 22, 2015 
Accepted October 23, 2015 @ 


Journal of Educational Psychology 
2016, Vol. 108, No. 7, 1013-1027 


© 2016 American Psychological Association 
0022-0663/16/$12.00 http://dx.doi.org/10.1037/edu0000106 


Teachers’ Self-Efficacy in Relation to Individual Students With a Variety 
of Social-Emotional Behaviors: A Multilevel Investigation 


Marjolein Zee, Peter F. de Jong, and Helma M. Y. Koomen 


University of Amsterdam 


The present study examined teachers’ domain-specific self-efficacy (TSE) in relation to individual 
students with a variety of social-emotional behaviors in class. Using a sample of 526 third- to sixth-grade 
students and 69 teachers, multilevel modeling was conducted to examine students’ externalizing, 
internalizing, and prosocial behaviors as predictors of TSE toward individual students, and the potential 
moderating roles of teaching experience and teachers’ perceived amount of classroom misbehavior. 
Results showed that most of the variance in TSE occurred within teachers. Students’ externalizing 
behavior was negatively associated with TSE for instructional strategies, behavior management, student 
engagement, and emotional support. In contrast, teachers reported higher levels of self-efficacy toward 
students with high levels of prosocial behavior, irrespective of teaching domain. Students’ internalizing 
behavior predicted lower levels of TSE for instructional strategies and emotional support, and higher 
levels of TSE for behavior management. Last, teachers’ perceived levels of classroom misbehavior 
exacerbated the negative association between externalizing student behavior and TSE for behavior 
management. These findings illustrate the importance of viewing TSE from a dyadic perspective. 


Keywords: sources of student-specific teacher self-efficacy, internalizing, externalizing, prosocial be- 


havior 


Challenging students bring many behaviors and qualities to the 
classroom that may seriously hamper teachers’ ability to execute 
their daily teaching tasks (Westling, 2010). Studies have indicated 
that behaviorally or emotionally disturbed students unnecessarily 
take time away from instruction, try teachers’ patience, fail to 
comply with classroom rules, and consequently, may hinder teach- 
ers’ efforts to sustain a positive learning climate (Bru, 2009; 
Clunies-Ross, Little, & Kienhuis, 2008; Putnam, Luiselli, Handler, 
& Jefferson, 2003). Undoubtedly, some teachers may experience 
little trouble nipping such behaviors in the bud. For many others, 
however, students’ challenging behavior frequently marks the be- 
ginning of a vicious cycle of stress and burnout (e.g., Brouwers & 
Tomic, 2000; Fernet, Guay, Senécal, & Austin, 2012; Friedman, 
2006), which may eventually lead these teachers to leave the 
profession entirely (Tsouloupas, Carson, Matthews, Grawitch, & 
Barber, 2010). 

Scholars have laid claim to a number of factors that potentially 
discriminate teachers who cope effectively from those who are 
commonly struggling to manage challenging behavior. Of these 
factors, teachers’ self-efficacy (TSE) beliefs, or self-referent judg- 
ments of operative capability, are probably one of the most per- 
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vasive (Bandura, 1997; Tschannen-Moran & Woolfolk Hoy, 
2001). Past empirical evidence suggests that when educators have 
a resilient sense of self-efficacy, they are more likely to success- 
fully deal with challenging student behavior and to persist longer 
than teachers who lack such beliefs (e.g., Almog & Shechtman, 
2007; Lambert, McCarthy, O’ Donnell, & Wang, 2009). On a more 
theoretical note, self-efficacious teachers are also presumed to be 
steadily capable of motivating challenging students, to believe in 
their improvability, and to rely on intrinsic inducements to get 
these students to study (Bandura, 1997; Tschannen-Moran & 
Woolfolk Hoy, 2001). 

To date, the significance of self-percepts of efficacy for teachers’ 
dealings with students at the classroom level of analysis is fairly 
well-established in various teaching domains (Woolfolk Hoy, Hoy, & 
Davis, 2009). There is, however, a dearth of studies considering TSE 
toward individual students. This lack of research is disadvantageous, 
as efficacy judgments related to various teaching domains and indi- 
vidual students may more reliably predict teachers’ behaviors toward 
specific children, as well as the effort and persistence teachers put in 
teaching them (Bandura, 1997; Tschannen-Moran, Woolfolk Hoy, & 
Hoy, 1998). For a comprehensive understanding of teachers’ ability to 
manage particular students, and targeting interventions for handling a 
variety of social-emotional student behaviors, knowledge of both 
domain- and student-specific TSE may therefore be vital. To add to 
this knowledge, the present study aims to examine TSE in relation to 
individual students with a variety of social-emotional behaviors (i.e., 
externalizing, internalizing, and prosocial behavior) in the classroom. 


Conceptualization of Teachers’ Self-Efficacy 


Teachers’ self-percepts of efficacy have long been considered a 
vital cognitive resource for teachers, with clear contributions to 
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their performances and sense of well-being in the classroom (Klas- 
sen & Tze, 2014; Tschannen-Moran & Woolfolk Hoy, 2001; 
Woolfolk Hoy et al., 2009). When teachers generally perceive 
themselves as highly efficacious, they are more likely to use 
differentiated instructional methods, employ emotionally support- 
ive behaviors that increase students’ confidence, and adopt proac- 
tive approaches to managing student-teacher conflict (Andreou & 
Rapti, 2010; Hoy & Woolfolk, 1990; Martin & Sass, 2010; Morris- 
Rothschild & Brassard, 2006; Thoonen, Sleegers, Oort, Peetsma, 
& Geijsel, 2011; Wertheim & Leyser, 2002). Teachers with a 
robust sense of general, classroom-level self-efficacy have further- 
more been found to be more satisfied with their job and to suffer 
less from burnout symptoms than less efficacious educators (Brou- 
wers, Evers, & Tomic, 2001; Caprara, Barbaranelli, Borgogni, & 
Steca, 2003; Friedman, 2003; Klassen & Chiu, 2010; Skaalvik & 
Skaalvik, 2010). These outcomes resonate well with the social— 
cognitive view that self-efficacy is a potent force in affecting the 
motivational, affective, cognitive, and selective processes needed 
for desired goals to be realized (Bandura, 1986, 1997). 

Scholars have keenly been on the lookout for relevant dimen- 
sions in teachers’ sense of self-efficacy (Tschannen-Moran & 
Woolfolk Hoy, 2001). Over the years, various conceptualizations 
and measures of TSE have come onto the scene, from global TSE 
scales based on locus of control theory (Gibson & Dembo, 1984; 
Guskey, 1981; Rose & Medway, 1981) to subject-, task-, or 
domain-specific measures that consider the contextualized, multi- 
faceted nature of TSE (e.g., Brouwers & Tomic, 2000; Friedman & 
Kass, 2002; Tschannen-Moran & Johnson, 2011; Tsouloupas et 
al., 2010). Since the studies of Tschannen-Moran and colleagues 
(Tschannen-Moran & Woolfolk Hoy, 2001; Tschannen-Moran et 
al., 1998), however, the well-validated three-factor model of TSE 
for instructional strategies, classroom management, and student 
engagement has dominated the field. The domains of TSE for 
instructional strategies and student engagement mainly focus on 
aspects of instructional delivery. Generally, the instructional strat- 
egies domain attempts to capture teachers’ perceived capability in 
using various instructional methods that enable and enhance stu- 
dent learning. Teachers’ self-efficacy for student engagement is 
useful in measuring the extent to which teachers feel able to 
activate students’ interest in their schoolwork. In addition to the 
instructional aspects of teaching and learning, TSE for classroom 
management encompasses teachers’ judgments of their ability to 
organize students’ time, behavior, and attention (cf. Emmer & 
Stough, 2001). Although moderate to strong correlations among 
the three domains of TSE exist, there is empirical evidence to 
suggest that each construct assesses unique aspects of teachers’ 
sense of self-efficacy (e.g., Heneman, Kimball, & Milanowski, 
2006; Tschannen-Moran & Woolfolk Hoy, 2001). Thereby, 
Tschannen-Moran and Woolfolk Hoy’s model substantiates the 
social—cognitive premise that TSE is specific to different tasks and 
domains of teachers’ functioning (Bandura, 1997; Tschannen- 
Moran et al., 1998). 

Despite general consensus on the highly context-specific nature 
of TSE, most research has been conducted at the classroom-level 
of analysis, focusing on teachers’ general beliefs of capability 
toward the class they currently teach. As such, these studies could 
be considered to be subject to the ecological fallacy (Piantadosi, 
Byar, & Green, 1988) that teachers’ self-percepts of efficacy also 
hold for individual students. Assumedly, students all bring idio- 
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syncratic behaviors and characteristics to the classroom that may 
more or less impact teachers’ self-efficacy beliefs across different 
domains ‘of teaching and learning. Whereas obliging and hard- 
working students will most likely raise teachers’ self-efficacy, 
instances of misconduct may seriously undermine teachers’ 
student-specific capability beliefs. Two multilevel studies 
(Raudenbush, Rowan, & Cheong, 1992; Ross, Cousins, & Gadalla, 
1996), based on a single-item measure to evaluate TSE at the 
classroom-level, indicated that between 13% and 44% of the 
variance in TSE can be explained by such within-class variables as 
students’ grade, academic level, and interest in their schoolwork. 
In addition, empirical research and theorizing from Spilt and 
colleagues (Spilt & Koomen, 2009; Spilt, Koomen, & Thijs, 2011) 
suggested that individual students who display behavioral prob- 
lems are more likely to weaken teachers’ ‘self-efficacy beliefs and 
to evoke feelings of helplessness than students without such prob- 
lems. These findings suggest that teachers may significantly vary 
in their self-efficacy toward particular students. 


Students’ Social—Emotional Behaviors as 
Predictors of TSE 


Social—cognitive theorists have generally asserted that self- 
percepts of efficacy are shaped, in large part, by specific events 
and experiences linked to distinct realms of functioning (Bandura, 
1997). For teachers, such experiences typically derive from au- 
thentic educational endeavors with students. Indeed, a sparse 
amount of existing research (Bandura, 1997; Tschannen-Moran & 
Woolfolk Hoy, 2007; Tschannen-Moran et al., 1998) has theorized 
that successful experiences with instructing, engaging, and man- 
aging students may significantly add to a healthy sense of TSE. In 
contrast, unsuccessful dealings with individual students, and par- 
ticularly those who display challenging behavior, have been em- 
pirically evidenced to elicit negative emotions that lead teachers to 
lose faith in their capabilities and collapse under the burden of 
everyday stress (Emmer & Stough, 2001; Spilt & Koomen, 2009; 
Spilt et al., 2011; Tsouloupas et al., 2010). Accordingly, teachers’ 
classroom experiences and subsequent feelings of self-efficacy 
may be heavily influenced by a variety of social-emotional stu- 
dent behaviors in the classroom. In line with prior research on 
students’ social-emotional adjustment (e.g., Roorda, Verschueren, 
Vancraeyveldt, Van Craeyevelt, & Colpin, 2014), we consider 
students’ externalizing, internalizing, and prosocial behaviors as 
sources of TSE toward individual students. 

Externalizing behavior. Past empirical research has repeat- 
edly pinpointed externalizing student behavior, including aggres- 
sion, hyperactivity, and antisocial behavior, to be at the core of the 
challenges most teachers face on a daily basis (Brouwers & Tomic, 
2000; Evers, Tomic, & Brouwers, 2004; Hastings & Bham, 2003; 
Kokkinos, Panayiotou, & Davazoglou, 2004, 2005; Kyriacou, 
2001; Roehrig, Pressley, & Talotta, 2002). These disruptive be- 
haviors may ripple through the entire classroom and have been 
suggested to cause elevated levels of stress and emotional exhaus- 
tion in teachers (Clunies-Ross et al., 2008; Kokkinos et al., 2004; 
Spilt & Koomen, 2009; Tsouloupas et al., 2010). Evidently, indi- 
vidual students’ externalizing behavior patterns may color teach- 
ers’ initial experiences and enduring beliefs of capability to effec- 
tively deal with them. The correlational results of Lambert and 
colleagues (2009), for instance, put forward that highly overactive 
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and distractible students may generally hamper US teachers’ atti- 
tude toward their teaching abilities, and their sense of self-efficacy 
in dealing with, and establishing positive relationships with chal- 
lenging students. Also focusing on US teachers’ self-efficacy for 
classroom management, Tsouloupas et al. (2010) demonstrated 
that high levels of teacher-perceived misbehavior in the classroom 
may negatively affect TSE in dealing with disruptive behavior and 
stressful situations, which, in turn, may cause them to feel emo- 
tionally exhausted. Other empirical research from Cyprus (e.g., 
Kokkinos et al., 2004, 2005) and the United States (Roehrig et al., 
2002) has indicated that behaviors of an externalizing nature, 
including conduct problems, hyperactivity, anger, and disrespect- 
fulness, generally yield the most negative impressions on teachers 
and may lead them to feel helpless and inefficacious. 

Additional to the literature linking students’ externalizing be- 
havior to general or domain-specific (classroom management) 
TSE at the classroom-level, a modest body of primarily American 
research has also begun to explore within-person variability in 
teacher cognitions. For instance, several scholars (e.g., Abidin & 
Robinson, 2002; Greene, Abidin, & Kmetz, 1997; Greene, Besz- 
terczey, Katzenstein, Park, & Goring, 2002) have highlighted 
teachers’ cognitions and judgments of individual student behavior 
as crucial contributors to their differential treatment of particular 
students in class. In line with this assertion, Spilt and Koomen 
(2009) used Pianta’s (1999) Teacher Relationship Interview and 
associated coding system to assess strengths and difficulties in 
teachers’ beliefs and feelings in relationships with specific, dis- 
ruptive students in the Netherlands. They revealed that teachers 
perceive themselves as angrier and less self-efficacious in relation 
to individual students who display disruptive behavior in the 
classroom. These outcomes are consistent with the idea that TSE 
may be highly individualized in nature and might depend on how 
teachers appraise individual students’ disruptive, externalizing be- 
haviors. 

Notably, negative personal feelings, cognitions, and efficacy 
beliefs seem to be particularly echoed in inexperienced teachers’ 
reports of their students’ behaviors (cf. Emmer & Stough, 2001). 
Using a grounded theory approach to study US teachers’ percep- 
tions of student needs, Feuerborn and Chinn (2012) revealed that 
novice teachers may express more emotionally laden reactions in 
relation to externalizing behavior than their experienced cowork- 
ers, and seem more afflicted by the instructional disruptions these 
behaviors cause. These qualitative findings stretch across empiri- 
cal studies from Europe as well. Results from Kokkinos and 
colleagues (Kokkinos et al., 2004, 2005) suggested that more 
experienced teachers generally perceive disruptive student behav- 
ior as less challenging and more controllable in the classroom. 
From this line of evidence, it can be hypothesized that increases in 
teachers’ experience may potentially buffer the negative associa- 
tion between teacher-perceived externalizing student behavior and 
student-specific TSE. 

Internalizing behavior. Counter to externalizing behavior, 
students with symptoms of internalizing behavior, including shy- 
ness, verbal inhibition, anxiety, or social withdrawal (Coplan, 
2000; Gazelle & Ladd, 2003; Merrell, 1999), have been suggested 
to evoke less challenging experiences or negative thoughts in their 
teachers (Rubin & Coplan, 2004). These internalizing difficulties 
may be more subtle than manifestations of externalizing conduct 
and usually tend to reflect more appropriate classroom behavior 
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and decorum (e.g., Coplan, 2000; Gresham & Kern, 2004; Kok- 
kinos et al., 2004; Rubin & Coplan, 2004). As such, internalizers 
are more likely to go undetected or ignored by their teachers than 
students with externalizing conduct (Coplan & Prakash, 2003) and 
may have little, if any, influence on teachers’ self-efficacy judg- 
ment toward them in different teaching domains. 

Yet, there might be some reason to believe that behaviors of a 
more internalizing nature may still be bothersome to the teacher 
and contribute to their self-percepts of efficacy (e.g., Olson & 
Cooper, 2001; Westling, 2010). Notably, the one empirical study 
to examine US teachers’ self-efficacy at the classroom-level in 
relation to internalizing student behavior indicated that highly 
self-efficacious teachers may be more bothered by students’ inter- 
nalizing behavior than those who are less confident in their per- 
sonal teaching effectiveness (Liljequist & Renk, 2007). One of the 
scenarios that may account for this finding is that a healthy sense 
of TSE frequently coincides with increases in teaching experience 
(e.g., Klassen & Chiu, 2010). Empirical studies of Kokkinos and 
colleagues (2004, 2005) pointed out that this growth in experience 
is essential for gaining knowledge of, and becoming sensitized to 
internalizers’ more subtle behavioral and affective cues. Without 
such vital knowledge and experience, teachers may feel less wor- 
ried about and less responsible for students’ internalizing behavior 
patterns, and thereby, less hindered in their self-efficacy to deal 
with them (cf. Liljequist & Renk, 2007). In contrast, when teachers 
consciously experience that their instructional initiatives are un- 
successful in establishing reciprocal interchanges with a student 
who displays internalizing behavior, a lowered sense of TSE 
toward this child is likely to arise. Hence, counter to the protective 
effect of teaching experience on the negative association between 
externalizing behavior on TSE, increases in teaching experience 
might serve as an additional risk factor for teachers’ self-efficacy 
toward students with internalizing symptoms. Unless teachers be- 
lieve they can gather up the resources to successfully deal with 
individual students with internalizing symptoms, they will proba- 
bly dwell on their actions, exercise inadequate effort, and may 
consequently experience failure. 

Prosocial behavior. Most of the previous work on teacher 
self-efficacy has predominantly attempted to study challenging 
student behavior as antecedents of these capability beliefs (e.g., 
Lambert et al., 2009; Liljequist & Renk, 2007; Tsouloupas et al., 
2010). It is likely, however, that students’ propensity to act proso- 
cially may also contribute to teachers’ self-efficaciousness toward 
individual children, but in a more favorable sense. Generally, 
prosocial behaviors are implicated with various voluntary acts 
intended to benefit others, including helping, sharing, comforting, 
and cooperating (Dunfield & Kuhlmeier, 2013; Dunfield, 
Kuhlmeier, O’Connell, & Kelley, 2011; Eisenberg, 1982). Such 
prosocial tendencies have frequently been linked to key classroom 
outcomes such as academic achievement (e.g., Caprara, Bar- 
baranelli, Pastorelli, Bandura, & Zimbardo, 2000; Malecki & 
Elliot, 2002; Wentzel, 1993), engagement (Coolahan, Fantuzzo, 
Mendez, & McDermott, 2000), and the quality of students’ rela- 
tionships with teachers and peers (Birch & Ladd, 1998; Henricsson 
& Rydell, 2004; Zimmer-Gembeck, Geiger, & Crick, 2005). As- 
sumedly, these agreeable behaviors and performances may provide 
teachers with the classroom mastery experiences that reinforce a 
healthy sense of self-efficacy (Goddard & Goddard, 2001; God- 
dard, Hoy, & Woolfolk Hoy, 2004). Therefore, teachers may feel 
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more self-efficacious when dealing with students who generally 
display prosocial behavior in the classroom, irrespective of teach- 
ers’ domain of functioning. 

Teachers’ perceived amount of misbehavior in the 
classroom. A number of empirical investigations from the 
United States have demonstrated that classrooms with many ag- 
gressive students may have a negative impact on the behaviors of 
its individual members. For instance, Werthamer-Larsson, Kellam, 
and Wheeler (1991) found that regular students from poorly be- 
having classrooms were more often perceived as shy by their 
teacher, which can be perceived as an aspect of internalizing 
behavior (e.g., Letcher, Smart, Sanson, & Toumbourou, 2009). 
Several longitudinal studies have also indicated that students who 
are enrolled in classrooms with many aggressive students are 
likely to gradually become more aggressive themselves (e.g., Kel- 
lam, Ling, Merisca, Brown, & Ialongo, 1998; Thomas, Bierman, & 
The Conduct Problems Prevention Research Group, 2006; Thorn- 
berry & Krohn, 1997). Evidently, such trends may place an addi- 
tional burden on teachers’ ability to control these students’ behav- 
iors, and to maintain positive relationships with them (Brophy, 
1996; Doumen et al., 2008; Roorda et al., 2014). Hence, as 
classmates may contribute to escalating trends in students’ chal- 
lenging behaviors, teachers’ perceived negative classroom dynam- 
ics may be hypothesized to exacerbate the relationship between 
individual students’ externalizing or internalizing behavior and 
TSE. 


Present Study 


The present study aimed to extend the current literature by 
exploring a variety of social-emotional behaviors as predictors of 
teachers’ domain- and student-specific self-efficacy beliefs. Al- 
though the consequences of classroom-level TSE for teachers’ 
dealings with student behavior have been fairly well established 
(Woolfolk Hoy et al., 2009), empirical work on TSE seems to have 
stopped short of considering how students’ social-emotional be- 
haviors are associated with TSE across various teaching domains 
(e.g., instructional strategies or classroom management) and to- 
ward individual students (cf. Klassen, Tze, Betts, & Gordon, 
2011). Moreover, the handful of studies (e.g., Lambert et al., 2009; 
Spilt & Koomen, 2009; Tsouloupas et al., 2010) that have specif- 
ically looked into these effects tend to focus solely on patterns of 
externalizing behavior, thereby largely neglecting internalizing 
and prosocial behaviors as correlates of TSE. Building an under- 
standing of how teachers’ sense of self-efficacy is shaped by 
individual students’ various behaviors in different domains of 
teaching and learning may provide a vital foundation for interven- 
tions targeted to teachers’ dealings with challenging students. 

Based on the body of evidence on teachers’ classroom-level 
self-efficacy, several hypotheses were formulated. First, we ex- 
pected teachers to report lower levels of self-efficacy toward 
individual students with externalizing and internalizing problems, 
and higher levels of self-efficacy toward students who display 
prosocial behavior, irrespective of teachers’ domain of function- 
ing. Given the more subtle nature of students’ internalizing behav- 
ior, we expected the link between this student behavior and 
student-specific TSE across domains of teaching and learning to be 
weaker than the associations between students’ externalizing and 
prosocial behavior and student-specific TSE. Second, we hypoth- 
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esized that relatively high levels of teachers’ perceived classroom 
misbehavior and a lack of teacher experience may further worsen 
the negative association of individual students’ externalizing and 
internalizing behavior with student-specific TSE. 


Method 


Participants 


Data for the current study were collected from 69 regular Dutch 
elementary school teachers and 526 third- to sixth-grade students. 
The schools from which the sample was drawn were recruited via 
telephone and e-mail, after ethical approval was granted by the 
Ethics Review Board of the Faculty of Social and Behavioral 
Sciences, University of Amsterdam (project no. 2013-CDE-3188). 
Of the 350 schools that were initially invited, 24 (6.9%) from both 
rural and urban areas across the Netherlands ultimately agreed to 
take part in this study. Nonparticipation was mainly due to the 
school’s already full agenda, or their involvement in other research 
studies. 

Participating teachers (72.6% females) had a mean age of 41.42 
years (SD = 12.34, range = 23 to 63 years). The professional 
teaching experience of these educators in primary education 
ranged from 1.5 to 44 years, with a mean of 16.67 years (SD = 
11.87). Four teachers did not provide complete demographic in- 
formation. For the student sample, eight students (four boys and 
four girls) were randomly selected from the pool of students from 
each teacher’s classroom whose parents had initially provided 
informed consent. These students were distributed across Grades 3 
(n = 54), 4 (n = 157), 5 (n = 165), and 6 (n = 150), respectively. 
At recruitment, the sampled children ranged from 7.71 to 13.04 
years of age (M = 10.57, SD = 1.11), and the gender composition 
was evenly distributed with 263 boys (50.0%) and 263 girls 
(50.0%). Based on students’ self-reports, the study sample ap- 
peared to be 85.2% Dutch, and 12.3% non-Dutch. In 2.5% of the 
cases, students failed to provide information regarding their eth- 
nicity. Based on employment statistics and parents’ education, 
most students could be considered to have an average to high 
socioeconomic status. Teachers reported both parents of partici- 
pating students to be employed in 76.8% of the families. In 20.4% 
of the cases, at least one parent appeared to be employed, and only 
2.5% of the families included two unemployed parents. In addi- 
tion, teachers indicated the majority of the parents to have finished 
senior vocational education (49.0%) or higher education (46.2%), 
leaving less than 5% of the parents to only have finished primary 
education. 


Instruments 


Students’ social-emotional behaviors. Teachers were asked 
to complete the Dutch version of the Strengths and Difficulties 
Questionnaire (SDQ; van Widenfelt, Goedhart, Treffers, & Good- 
man, 2003) to evaluate a variety of students’ social—emotional 
behaviors. The SDQ is a brief 25-item behavioral screening ques- 
tionnaire that measures students’ adjustment and psychopathology 
in the classroom. The scale originally consists of positive and 
negative student attributes that together represent five factors re- 
flecting strengths (Prosocial Behavior) and difficulties (Emotional 
Symptoms, Conduct Problems, Hyperactivity-Inattention, and 
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Peer Problems). In the present study, however, use was made of 
the more general Internalizing, Externalizing, and Prosocial Be- 
havior subscales, which generally are preferred over the original 
SDQ factors in low-risk samples (Goodman, Lamping, & Ploubi- 
dis, 2010). The Externalizing Behavior dimension (10 items) com- 
bines the subscales of Hyperactivity-Inattention and Conduct 
Problems, with items such as “Restless, hyperactive, cannot sit still 
for long” and “Often has temper tantrums or hot tempers.” Addi- 
tionally, the Internalizing Behavior subscale (8 items) comprises 
all items from the Emotional Symptoms factor, and three items 
from the Peer Problems factor (i.e., “Rather solitary, tends to play 
alone”, “Gets on better with adults than with other children” and 
“Picked on or bullied by other children”). Last, the 7-item Proso- 
cial Behavior scale, reflects all five items from the Prosocial scale, 
and two items from the Peer Problems scale (i.e., “Generally liked 
by other children” and “Has at least one good friend”). Teachers 
responded on the 25 items on a 5-point Likert scale, ranging from 
1 (not true) to 5 (certainly true). 

The psychometric properties of the three-factor SDQ model 
have been demonstrated to be especially suited for use in nonrisk 
samples (Dickey & Blumberg, 2004; Goodman et al., 2010; van 
Leeuwen, Meerschaert, Bosmans, de Medts, & Braet, 2006). To 
evaluate whether the SDQ’s three-factor solution also held in the 
present study, we performed a confirmatory factor analysis (CFA), 
using maximum likelihood estimation with robust standard errors 
and a mean-adjusted chi-square test statistic (MLR; Muthén & 
Muthén, 1998-2012). Guided by the residual covariance matrix 
and modification indices, we added four theoretically plausible 
correlated residuals to the baseline model. Two of those correlated 
residuals were indicative of aspects of students’ externalizing 
behavior. Specifically, the residuals of items 2 and 10 both re- 
flected students’ hyperactivity, and the residuals of items 15 and 
25 primarily evaluated students’ attention span. Also correlated 
were the residuals of prosocial items 9 and 20, which indicated 
students’ willingness to help others. Last, the residuals of inter- 
nalizing items 16 and 24 were allowed to correlate, as they were 
both symptomatic of students’ nervousness and anxiety. 

Despite a relatively low comparative fit index (CFI), this revised 
model yielded an acceptable fit according to established cutoff 
values of .08 for the root-mean-square error of approximation 
(RMSEA) and standardized root-mean-square residual (SRMR; 
Browne & Cudeck, 1993; Hu & Bentler, 1999; Kline, 2011), 
x7(268) = 890.04, p < .001, RMSEA = .067 (90% confidence 
interval [CI] [.062, .072]), CFI = .84, SRMR = .074. These fit 
indices are consistent with previous research (Goodman et al., 
2010; van Leeuwen et al., 2006), reporting acceptable RMSEA and 
SRMR values for the three-factor solution, but CFIs below the 
conventional threshold of .90 for satisfactory fit (e.g., Bentler, 
1990, 1992; Little, 2013). Recommendations for cutoff values for 
various fit indices have previously been called into question, 
* however, given that the mean value and the distribution of most fit 
indices are likely to change with sample size, the distribution of the 
data, and the chosen test statistic (e.g., Yuan, 2005). The factor 
loadings of the SDQ subscales in the present study were adequate, 
ranging from .42 to .73 for Externalizing Behavior, from .41 to .80 
for Internalizing Behavior, and from .50 to .82 for Prosocial 
Behavior, respectively. Cronbach’s alphas were .81 for Internaliz- 
ing Behavior, .87 for Externalizing Behavior, and .86 for Prosocial 
Behavior, respectively. 
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Classroom misbehavior. A short, three-item scale developed 
by Tsouloupas et al. (2010) was used to measure teachers’ per- 
ceived amount of student behavior problems in their classroom. 
Items that made up this instrument included “How frequently do 
you experience negative interactions with students?”, “How often 
do you deal with student discipline problems?” and “On average, 
how emotionally intense are your dealings with student discipline 
problems?” All items were scored on a 5-point Likert-type scale, 
ranging from | (almost never occurs) to 5 (occurs very frequently). 
In the present sample, Cronbach’s alpha for this measure was .83. 

Domain- and student-specific teacher self-efficacy. Teachers’ 
perceptions of their self-efficacy toward individual students across 
various teaching domains were estimated using the Student- 
Specific Teacher Self-Efficacy Scale (Zee & Koomen, 2015). This 
instrument, which is adapted from the Teachers’ Sense of Efficacy 
Scale (TSES; Tschannen-Moran & Woolfolk Hoy, 2001), is spe- 
cifically designed to evaluate teachers’ student-specific capability 
beliefs across various domains of teaching and learning. Largely 
similar to the original TSES, this instrument represents the three 
domains of Instructional Strategies (IS; 6 items), Behavior Man- 
agement (BM; 5 items), and Student Engagement (SE; 6 items). 
The domain of IS measures the extent to which teachers feel able 
to use various instructional methods that enable and enhance 
individual students’ learning, with items such as “How well can 
you respond to difficult questions from this student?” Slightly 
different from the original Classroom Management dimension is 
the BM domain, which no longer taps aspects of classroom orga- 
nization, but rather concentrates on teachers’ perceptions of their 
ability to organize and guide the behaviors of a particular student. 
A sample item of this subscale includes “How much can you do to 
get this child to follow classroom rules?” Teachers’ self-efficacy 
for SE captures teachers’ perceived ability to activate the interest 
of a particular student in his or her schoolwork. This domain of 
TSE includes items such as “How much can you do to get this 
student to believe he/she can do well in schoolwork?” 

Next to the three broad domains proposed by Tschannen-Moran 
and Woolfolk Hoy (2001), the student-specific TSES is also tar- 
geted to the domain of Emotional Support (ES; 7 items). This 
additional domain involves tasks and responsibilities related to 
how well teachers can establish caring relationships with students, 
acknowledge students’ opinions and feelings, and create settings in 
which students feel free to explore and learn. One example item of 
this subscale includes “How well can you establish a safe and 
secure environment for this student?” 

All items that made up this measure were rated by teachers on 
a seven-point Likert-type scale, ranging from 1 (nothing) to 7 (a 
great deal). A CFA using MLR (Muthén & Muthén, 1998-2012) 
provided sufficient fit to the present study’s data, after adding 
correlations between the residuals of items 13 and 14, and 19 and 
20, x? (244) = 810.36, p < .001, RMSEA = .067 (90% CI [.062, 
.072]), CFI = .91, SRMR = .073. Both correlated residuals 
seemed theoretically plausible. Specifically, SE Items 13 and 14 
focused on teachers’ perceived capability to motivate individual 
students for their schoolwork. Items 19 and 20, in addition, con- 
centrated on the extent to which the teacher felt capable of re- 
sponding positively and sincerely to a particular student. All stan- 
dardized factor loadings were considered high in this model 
(>.55), thereby supporting the factorial validity of the student- 
specific TSES. Internal consistency scores of the student-specific 
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TSES domains were .89 for IS, .94 for BM, .90 for SE, and .85 for 
ES, respectively. 


Procedure 


During recruitment, either school principals or participating 
teachers distributed information letters and consent forms to par- 
ents of all students from teachers’ classrooms. On average, paren- 
tal consent rates per classroom ranged between 46% and 100%. 
From all consents received, we randomly selected eight students 
from participating teachers’ classrooms and subsequently let these 
teachers know which eight students to report on. Students were 
asked to fill out several questions about their background charac- 
teristics, including students’ age, gender, and ethnicity, during a 
planned school visit. Teacher-reported questionnaires assessing 
students’ social-emotional behavior at school and teachers’ self- 
efficacy in relation to individual students were collected via an 
individually addressed digital survey link that was distributed by 
e-mail. Teachers filled out these questionnaires for each of the 
eight selected students from their classroom. Participating educa- 
tors additionally reported on some general questions regarding 
their background characteristics. The total survey took approxi- 
mately one hour to complete. Teachers were asked to return the 
digital survey within two weeks after the survey link was sent. To 
improve the participation rate, reminders were sent to nonrespond- 
ing teachers, resulting in a total response rate of 93.9%. Nonpar- 
ticipation was due to long-term sickness absence or teachers’ busy 
schedule. After participation, all teachers received a gift voucher 
of €20,00. 


Data Analysis 


To examine the contribution of teachers’ and students’ back- 
ground characteristics and a variety of student behaviors in pre- 
dicting teachers’ sense of self-efficacy toward individual students, 
we fitted a series of multivariate hierarchical linear models using 
Mplus 7.11 (Muthén & Muthén, 1998-2012). This analytical tech- 
nique is quite flexible in that it corrects for nested data structures, 
and avoids aggregation bias and underestimation of standard errors 
that sometimes compromise the outcomes of Ordinary Least 
Squares-analyses of multilevel data (Snijders & Bosker, 1999). All 
fixed and random effects parameters in these models were based 
on maximum likelihood estimation with robust standard errors and 
a mean-adjusted chi-square test statistic (MLR). Predictors were 
centered around the grand mean to ease their interpretation. 

Scale scores, represented by teachers’ mean response to relevant 
items, were used to reflect the main constructs of interest. Several 
empirical sources (e.g., Allen & Seaman, 2007; Kislenko & 
Grevholm, 2008; Leung, 2011; Parker, McDaniel, & Crumpton- 
Young, 2002) have indicated that scale scores may be treated as 
interval-level measures as long as the psychometric properties of 
the scale are sufficient. Generally, such scale scores have been 
shown to be largely insensitive to the violation of the interval 
assumption at the item-level (e.g., Leung, 2011; Parker et al., 
2002). 

In accordance with the methods proposed by Raudenbush and 
Bryk (2002), we adopted a stepwise sequential modeling strategy, 
reflecting an increasing complexity with each successive model. In 
the first step, we estimated an unconditional means model without 


ZEE, DE JONG, AND KOOMEN 


predictors to partition the variance of teachers’ student-specific 
self-efficacy at the within-teacher and between-teachers level. This 
preliminaty model was used as a baseline for subsequent model 
comparisons. In the second step, we added students’ background 
characteristics, and their Externalizing, Internalizing, and Proso- 
cial Behaviors as within-level (fixed) effects of teachers’ student- 
specific Self-Efficacy. After these individual student characteris- 
tics were accounted for, we added between-teachers covariates to 
the equation to explain variance at the between-teachers level. 
Last, to examine the existence of cross-level interactions of stu- 
dents’ behaviors and Teaching Experience with teachers’ per- 
ceived Classroom Misbehavior, we allowed potential random 
slopes to vary across teachers. If a particular association between 
students’ behaviors and teachers’ student-specific self-efficacy 
significantly varied across teachers, cross*level interactions were 
added. 


Results 


Descriptive Statistics 


Table 1 presents descriptive statistics, including zero-order cor- 
relations, means, and standard deviations of the variables. Consis- 
tent with expectations, moderate to strong negative correlations 
were found between students’ Externalizing Behavior and dimen- 
sions of teachers’ Student-Specific Self-Efficacy. Notably, the 
association between Externalizing Behavior and TSE for BM 
appeared to be the strongest, suggesting that teachers felt the least 
confident in dealing with disruptive students in the domain of 
Behavior Management. Somewhat smaller negative correlations 
were found between students’ Internalizing Behavior and teachers’ 
Student-Specific self-percepts of Efficacy. These behaviors 
seemed to have a slightly higher association with teachers’ belief 
in their capability to provide individual students with adequate 
emotional support and security. The positive correlations between 
students’ Prosocial Behavior and TSE in relation to individual 
students were also in line with hypotheses. Teachers who generally 
perceived their students to act prosocially in the classroom seemed 
to experience higher levels of Self-Efficacy toward these students 
in all domains of teaching and learning. Teachers’ perceptions of 
the amount of misbehavior in the classroom were not associated 
with any of the domains of Student-Specific TSE. It is interesting 
to note, though, that teachers who reported a large amount of 
student misbehavior in the classroom did not appear to judge the 
externalizing behaviors of individual students to be higher than 
those who reported a smaller amount of classroom misbehavior. In 
contrast, a negative association was noted between teachers’ per- 
ceived Classroom Misbehavior and individual students’ Internal- 
izing Behavior. 

Last, the correlations among students’ and teachers’ background 
characteristics, students’ behaviors, and Student-Specific TSE re- 
vealed, first, that male teachers and more experienced educators 
generally reported their students to display higher levels of Inter- 
nalizing Behavior. Teaching Experience also seemed to be posi- 
tively linked to all domains of Student-Specific TSE, indicating 
that more experienced teachers perceive themselves as more effi- 
cacious than their less experienced counterparts. In addition, teach- 
ers were likely to report higher levels of Externalizing Behavior 
and lower levels of Prosocial Behavior for boys and older students, 
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Table 1 


Descriptive Statistics and Correlations 
(ESI SES SSS SORE SECA it le ada ca Set aca, nscale hctescanieminaa= ah 


Variable 1 2 3 4 5 6 7 8 9 10 11 12 


1. Teacher gender — 

2. Teacher experience = — 

3. Student gender .03 (08) — 

4. Student age .05 —.08 elias — 

5. Externalizing behavior —.08 —.03 — 26, al0- a 

6. Internalizing behavior Stor mgs 03 .08 am oe 

7. Prosocial behavior .02 .03 Olam S12) Some 4 ae _ 

8. Classroom behavior problems .08 e202 .03 a2 SoU, i 19 .06 — 

9. Student-specific TSE for IS —.04 eye Sma S emer 4 Oi ode 45** 07 == 

10. Student-specific TSE for BM .00 ae DEES ae an i 59** 02 SOs a 
11. Student-specific TSE for SE —.04 cS ae IS gg Slice SSS Sil 54 .08 88" os _— 
12. Student-specific TSE for ES .02 Ge Ae 0 sae One .08 .80"* 1655. 847 
M 16.67 10.57 1.96 2.03 4.07 2.48 5.53 6.14 5.60 5.82 
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Note. Gender: 0 = boys/male teachers, 1 = girls/female teachers. TSE = teachers’ self-efficacy; IS = instructional strategies; BM = behavior 
management; SE = student engagement; ES = emotional support. 
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and felt the least efficacious when dealing with these particular 
students. Last, it is interesting to note that students’ Internalizing 
and Externalizing Behavior were moderately correlated with each 
other, potentially suggesting comorbidity between behaviors in the 
externalizing, and internalizing spectrum (cf. Keiley, Lofthouse, 
Bates, Dodge, & Pettit, 2003). In the present study, the focus was 
on the unique associations between students’ social-emotional 
behaviors and teachers’ self-efficacy beliefs across domains and 
individual students. 


Unconditional Means Model 


In the first step of the analyses, we fitted an unconditional means 
model, only containing the four outcome variables (teachers’ 
Student-Specific Self-Efficacy for IS, BM, SE, and ES), and no 
predictors other than the intercept. Intraclass correlations in this 
model indicated that 14.8% to 30.7% of the variance in teachers’ 
self-efficacy toward individual students occurred between teach- 
ers. Generally, less than 5% of the variance in the domains of 
Student-Specific TSE, however, was found to be associated with 
the school-level of hierarchy, implying that teachers’ Student- 
Specific capability beliefs did not vary much across schools. Given 
the substantial variance accounted for at the within-teacher and 
between-teachers level, it can be concluded that the data require a 
model that addresses the nesting of students within teachers. 


Student Predictors of Teachers’ Student-Specific 
Self-Efficacy 


Fixed effects of students’ background characteristics (Age and 

* Gender) and behaviors (Internalizing, Externalizing, and Prosocial 
Behavior) were modeled to allow the identification of variables 

that were uniquely related to variation among dimensions of 

Student-Specific TSE. This first model (see Table 2) significantly 

improved the prediction of teachers’ Student-Specific Self- 

Efficacy beliefs, TRd (6) = 826.82, p < .001. Assessment of 

unstandardized coefficients pointed to statistically significant neg- 

ative associations between students’ Externalizing Behavior and 

teachers’ Student-Specific Self-Efficacy for IS (B = —.38, p < 


.001), BM (B = -.73, p < .01), SE (B = —.55, p < .001), and ES 
(B = —27, p < .001). This indicates that with each scale point 
higher on students’ Externalizing Behavior, teachers’ Student- 
Specific Self-Efficacy across domains is expected to decrease 
between —.27 and -.73 scale points (Hox, 2002). In addition, 
students’ Internalizing Behavior was only uniquely and positively 
associated with Student-Specific TSE for BM (B = .13, p < .001), 
and negatively associated with Student-Specific TSE for ES (B = 
—.08, p < .05). After accounting for Externalizing and Internaliz- 
ing Behaviors, students’ Prosocial Behavior yielded statistically 
significant positive results for all dimensions of Student-Specific 
TSE (IS: B = .28, p < .001; BM: B = .40, p < .001, SE: B = .34, 
p < .001; ES: B = .41, p < .001). Regarding students’ background 
characteristics, only students’ Age appeared to be negatively as- 
sociated with Student-Specific TSE for SE (B = —.11, p < .01) and 
ES (B = -.06, p < .05), indicating that teachers generally feel less 
self-efficacious in providing emotional support and promoting 
students’ engagement when dealing with older students. 


Teacher Predictors of Teachers’ Student-Specific 
Self-Efficacy 


After the effects of students’ background characteristics and 
behaviors were accounted for at the within-teacher level, we sub- 
sequently added teachers’ Gender, Teaching Experience, and per- 
ceived Classroom Misbehavior to the model to explain variance at 
the between-teachers level. Table 2 presents the results of these 
fixed and random effects of the analysis (Model 2). Compared to 
Model 1, we generally found no significant changes in the vari- 
ables at the within-teacher level. After addition of the teacher 
variables, however, the association between students’ Internalizing 
Problems and Student-Specific TSE for IS became statistically 
significant in Model 2 (B = -.13, p < .01), suggesting that 
teachers’ appraisals of students’ Internalizing Behavior may be 
affected by features inherent to the teacher. Yet, the significant 
link between students’ Age and TSE for ES failed to reach the 
significance threshold in this second model. 

Regarding the teacher-level variables, only statistically signifi- 
cant associations were noted between Teacher Experience and 
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Table 2 
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Fixed and Random Estimates for Predictors of Teachers’ Domain- and Student-Specific Self-Efficacy 


Student-Specific 


Student-Specific 


Student-Specific Student-Specific 





TSE for IS TSE for BM TSE for SE TSE for ES 
M1 M2 M1 M2 M1 M2 M1 M2 
Predictor B (SE) B (SE) B (SE) B (SE) B (SE) B (SE) B (SE) B (SE) 

Fixed parameters 

Intercept 5909) ce OG S)ee 613°605)™ ~ GS CU) S.67GOT) ie eS eiSiGlL) sien 5:80.06) see. 6 5 COS)am 
Student-level variables 

Student gender —.04 (.04) —.02 (.04) .03 (.03) .03 (.03) —.06 (.04) —.05 (.04) .0S (.06) .07 (.04) 

Student age —.13 (.08) —.12 (.09) —.04 (.04) —.02 (.03) —LSC0)) plo G06)h = eh COG), 209) 06) 

Externalizing behavior =23 1/C06)i— 1:43:06) 263604)” == 265105) 55245105) 53105) 30 C 06) nee SSCs 

Internalizing behavior —.06 (.05) —131@05S)a .10 (.04)™* .09 (.04)* .O1 (.04) —.07 (.04) =109'C04)> i —=116;C04)ie 

Prosocial behavior .24 (.06)™* .19 (.06)** 32€05)i 21 (05), ~ COO) a .17 (.06)** 42 (.05)** .38 (.06)** 
Teacher-level variables 

Teacher gender —.20 (.14) —.05 (.18) a(S) —.08 (.13) 

Teacher experience .29 (.15) PL TRG) 43 (.14)™* BIMCIS)) 

Classroom misbehavior .17 (.14) —.16 (.14) 16 (.14) .03 (.14) 
Random parameters 

Between-teachers variance .86 (.09)** 91 (.08)** EW @iiye .84 (.10)™* 

Within-teacher variance .67 (.04)** .60 (.04)"* -391¢03) > .35 (.03)*™* .61 (.04)™ .54 (.04)™* 51 (.04)™* 44 (.03)™* 
R? statistics 

Revithin 33 40 65 65 39 46 49 56 

Reetween 14 09 25 16 
Note. Gender: 0 = boys/male teachers, 1 = girls/female teachers. TSE = teachers’ self-efficacy; IS = instructional strategies; BM = behavior 
management; SE = student engagement; ES = emotional support. 
+Pe05 ee pe Ole 


teachers’ sense of Student-Specific Self-Efficacy for SE (B = .01, 
p < .01) and ES (B = .01, p < .05). The relationships of teachers’ 
Gender and perceived Classroom Misbehavior with the dimen- 
sions of Self-Efficacy toward particular students were not statisti- 
cally significant. Overall, student variables accounted for 40% of 
the within-teacher variance in Student-Specific TSE for IS, 65% in 
TSE for BM, 46% in TSE for SE, and 56% in TSE for ES, 
respectively. At the between-teachers level, 14%, 9%, 25%, and 
16% of the variance in the respective Student-Specific TSE do- 
mains for IS, BM, SE, and ES was explained by the student- and 
teacher-level predictors. 


Cross-Level Interactions 


To evaluate whether Teacher Experience and perceived Class- 
room Misbehavior interacted in the prediction of Student-Specific 
TSE, the slopes of the student predictors were first allowed to vary 
across teachers. The random slope coefficients of the association 
between students’ Externalizing Behavior and Student-Specific 
TSE for BM (o* = .08, p < .01), and between Prosocial Behavior 
and Student-Specific TSE for BM (co? = .09, p < .01) and ES 
(o* = .02 p < .05) were significantly different from zero, indi- 
cating that these parameters varied across teachers. Consequently, 
cross-level interactions between the teacher variables (i.e., Teacher 
Experience and perceived Classroom Misbehavior) and these stu- 
dent predictors were added stepwise to the model. Adding these 
cross-level interactions did not affect the significance of the pa- 
rameter estimates of Model 2. None of these cross-level interac- 
tions reached the significance threshold, except for the negative 
effect of teachers’ perceptions of Classroom Behavior Problems on 
the association between students’ Externalizing Behavior and 
Student-Specific TSE for BM (B = -.19, p < .01). This finding 


indicates that teachers feel less efficacious in managing individual 
students’ externalizing behavior when they perceive high amounts 
of misbehavior in the classroom. 


Discussion 


This study investigated the associations between a variety of 
social-emotional student behaviors and teachers’ self-efficacy be- 
liefs toward individual students in various teaching domains. In 
addition, the moderating role of teachers’ professional experience 
and perceived classroom misbehavior was examined. Results from 
this study offer new insights into the ways in which students’ 
externalizing, internalizing, and prosocial behaviors may hamper 
or support teachers’ self-efficacy beliefs across teaching domains 
at a dyadic level. 


Teachers’ Self-Efficacy in Relation to 
Externalizing Behavior 


Consistent with expectations, teachers perceived themselves as 
less self-efficacious in relation to students who exhibited external- 
izing behavior in class, after controlling for students’ and teachers’ 
background characteristics. This is in support of previous research 
on teachers’ classroom-level self-efficacy (e.g., Lambert et al., 
2009; Tsouloupas et al., 2010), indicating that disruptive children 
may hamper teachers’ self-efficacy in dealing with challenging 
behavior and stressful situations in the classroom. However, 
whereas past studies have almost solely concentrated on total 
efficacy scores or domain-specific TSE for behavior management, 
our results additionally show that these undercontrolled behaviors 
are consistently linked to various domains of self-efficacy for 
teaching and learning. Accordingly, unsuccessful encounters with 
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students who display externalizing conduct are likely to undermine 
teachers’ perceived capability to effectively instruct, motivate, 
manage, and emotionally support individual students. Such poorer 
self-efficacy beliefs, in turn, may also bring about more disruptive 
student behavior in new situations (e.g., Bandura, 1997). 

It is not surprising that the association between externalizing 
student behavior and teachers’ perceived capability in deploying 
effective methods to prevent and redirect instances of student 
misbehavior appeared to be the largest. Possibly, these patterns of 
externalizing misconduct reflect a poorer fit with teachers’ expec- 
tations for appropriate behavior in the classroom than other chal- 
lenging student behaviors (Gresham & Kern, 2004). Such behav- 
ioral mismatches may trigger a pattern of disturbed student-teacher 
interactions, which potentially undermine teachers’ feelings of 
efficacy and satisfaction in teaching (cf. Koomen & Spilt, 2011). 
This is alarming, given that an unhealthy sense of self-efficacy for 
behavior management may encourage teachers’ use of ineffective 
conflict management styles, which may exacerbate students’ dis- 
ruptive behavior and potentially advance the erosion of teachers’ 
already feeble capability beliefs (e.g., Goddard et al., 2004; Jen- 
nings & Greenberg, 2009; Morris-Rothschild & Brassard, 2006). 

Perhaps of a more interesting note is the finding that symptoms 
of externalizing student behavior may also come at the expense of 
teachers’ student-specific self-efficacy beliefs in the instructional 
domain. There are some studies to support this finding, indicating 
that teachers generally feel less confident and effective in proac- 
tively involving disruptive students in high-quality instructional 
interactions and activities, and consequently resort to controlling 
and punitive behaviors toward these students (e.g., Arbeau & 
Coplan, 2007; Sutherland & Oswald, 2005; Wehby, Symons, 
Canale, & Go, 1998). Probably, such a lack of efficacy in instruct- 
ing and motivating challenging students may further reinforce 
these children’s expressions of anger and frustration toward the 
teacher as well as increase their off-task behavior and maladjust- 
ment in class (Arnold, 1997; Stipek & Miles, 2008). Thereby, a 
vicious cycle may be set into motion in which teachers’ student- 
specific self-efficacy percepts and instructional actions, and stu- 
dents’ subsequent social-emotional and task behaviors in class 
may influence each other in a reciprocal manner (cf. Bandura, 
1997; Stipek & Miles, 2008). Hence, given that externalizing 
student behaviors may hamper student-specific TSE in both in- 
structional and social-emotional domains, it seems essential to 
provide educators with the knowledge and skills necessary for 
teaching disruptive students self-regulation strategies that improve 
their classroom adjustment (cf. Koomen & Spilt, 2011). 


Teachers’ Self-Efficacy in Relation to 
Internalizing Behavior 


Consistent with expectations, internalizing behaviors seemed to 
* be less of a factor than externalizing student behavior in explaining 
variations in teachers’ self-percepts of student-specific self- 
efficacy. This finding resonates well with those of past research 
(e.g., Coplan & Prakash, 2003; Gresham & Kern, 2004; Kokkinos 
et al., 2004), suggesting that students’ internalizing symptoms 
might go undetected by their teachers, or are merely perceived as 
less serious. Accordingly, it is possible that teachers may display 
a greater zeal and persistence in educating internalizing children 
than externalizing children. 
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As yet, our results give reason to believe that behaviors in the 
internalizing spectrum may contribute to some aspects of teachers’ 
sense of student-specific self-efficacy. Specifically, teachers’ 
student-specific self-efficacy for emotional support seemed to be 
predicted best by students’ internalizing behaviors, after account- 
ing for students’ and teachers’ background features. One possibil- 
ity that may explain this negative association is that internalizers 
feel more wary and anxious in the face of social stimuli and 
consequently tend to refrain from daily interactions with their 
teacher (e.g., Arbeau, Coplan, & Weeks, 2010; Coplan & Prakash, 
2003; Rudasill, 2011). Such socially withdrawn behaviors may 
result in a student-teacher relationship pattern characterized by 
lower levels of closeness, and higher levels of dependency (e.g., 
Arbeau et al., 2010; Henricsson & Rydell, 2004; Roorda et al., 
2014). When teachers recurrently fail to connect and get through to 
these internalizing children, poorer self-efficacy beliefs toward 
these particular children may be prompted (e.g., Bandura, 1997). 
This may explain why teachers usually fall back into regulatory 
and dominant behaviors toward students with internalizing behav- 
ior (Roorda, Koomen, Spilt, Thijs, & Oort, 2013). 

It is somewhat surprising that teachers also reported slightly 
elevated levels of self-efficacy in the domain of behavior manage- 
ment toward students with internalizing behavior. One mainly 
methodological explanation for this finding may be that internal- 
izing student behavior merely functioned as a suppressor for 
predicting the fairly stronger, unique association among students’ 
externalizing behavior and TSE for behavior management. Ac- 
cording to Maassen and Bakker (2001), this phenomenon may 
occur when a predictor is positively correlated with another inde- 
pendent variable, but not with the criterion. In the present study, 
suppression may indicate that internalizing student behavior has 
more in common with externalizing conduct than with teachers’ 
student-specific self-efficacy for behavior management, and 
thereby improved externalizing behavior as a predictor of TSE for 
behavior management. This potential suppressor effect mirrors 
previous empirical research, suggesting that comorbid externaliz- 
ing and internalizing symptoms may occur more frequently than 
single-form behaviors, and should therefore be interpreted in com- 
bination with each other, rather than separately (e.g., Keiley et al., 
2003). Another, more theoretical justification for this effect is that 
students with anxious and withdrawn patterns of behavior (without 
potentially co-occurring externalizing symptoms) usually do not 
disturb their peers or challenge their teachers’ authority. Thereby, 
these students seem to meet teachers’ behavioral values and ex- 
pectations in the classroom (Gresham & Kern, 2004). As such, it 
is possible that teachers might actually feel quite self-efficacious in 
managing these students’ behaviors. 

Last, internalizing student behavior did not seem to seriously 
upset their teachers’ self-efficacy for tasks related to motivation 
and instructional delivery. The lack of association between stu- 
dents’ internalization and student-specific TSE for student engage- 
ment was, for instance, at odds with our expectation that teachers 
may feel less efficacious in activating their students’ interest in 
schoolwork when dealing with emotionally disturbed students. 
Moreover, the negative association between students’ internalizing 
behavior and TSE for instructional strategies only reached the 
significance threshold after accounting for teachers’ gender, expe- 
rience, and perceived classroom misbehavior. It may be that edu- 
cators’ recognition of, and responsiveness to internalizers’ subtle 
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cues are more likely to be affected by factors inherent or contex- 
tual to the teacher than their preoccupation with externalizers’ 
more blatant signs. Research of Kokkinos and colleagues (Kokki- 
nos et al., 2005; Kokkinos & Kargiotidis, 2014), for instance, put 
forth that teachers’ ability to recognize the needs and behaviors of 
students with internalizing problems increases as they have more 
teaching experience, and may depend on their own interpersonal 
sensitivity and gender. Correlational patterns between students’ 
social-emotional behaviors and teacher-level variables in the pres- 
ent study, including teaching experience and gender, largely sub- 
stantiate this assumption. Also, there is a strong possibility that 
students with internalizing symptoms, due to their subdued behav- 
iors, generally provoke less negative thoughts about instruction or 
feelings of inefficacy in their teachers, as it is more difficult for 
teachers to gauge these students’ comprehension of what they have 
taught (e.g., Rubin & Coplan, 2004). 


Teachers’ Self-Efficacy in Relation to 
Prosocial Behavior 


In line with expectations, teachers consistently reported higher 
levels of self-efficacy in relation to students who exhibit high 
levels of prosocial behavior. Again, stronger associations were 
noted for teachers’ self-efficacy toward emotional and behavioral 
domains of teaching and learning, than for instruction-related 
tasks. This is perhaps not surprising, as the domains of behavior 
management and emotional support are, in large part, concerned 
with how well teachers relate to, and interact with their students. 
Several empirical sources have shown that patterns of prosocial 
student behavior may pave the way for higher quality relationships 
with their teachers (Birch & Ladd, 1998; Henricsson & Rydell, 
2004; Roorda et al., 2014). Such enactive mastery experiences may 
raise teachers’ beliefs in their self-efficacy (Bandura, 1997; God- 
dard et al., 2004), potentially further stimulating individual stu- 
dents’ prosocial behaviors in the classroom. 

Despite teachers’ higher self-efficacy beliefs in relation to stu- 
dents who display relatively high levels of prosocial behavior, 
teachers have repeatedly been shown to spend less time with 
prosocial students, and regularly fail to give them credit for their 
positive behavior, especially when they get older (e.g., Arbeau & 
Coplan, 2007; Nesdale & Pickering, 2006). To maintain and fur- 
ther encourage prosocial behavior in their students, teachers should 
recognize the need to praise and respond to students’ appropriate 
behaviors in class. In doing so, teachers may further enhance their 
feelings of self-efficacy toward these individual children. 


Teachers’ Self-Efficacy in Relation to Student and 
Teacher Characteristics 


In investigating students’ background characteristics, we only 
found students’ age to be negatively associated with teachers’ 
student-specific self-efficacy for student engagement. This finding 
is supported by prior research (Wolters & Daugherty, 2007), 
noting that teachers, when dealing with older children, tend to 
report less confidence in their ability to keep students engaged. 
This intriguing finding seems to complement those of studies on 
student motivation (e.g., Fredricks & Eccles, 2002), which dem- 
onstrated a downward spiral in students’ competence-related be- 
haviors and motivation during their transition to middle school. 
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Future research should take the complex interplay between teach- 
ers’ self-efficacy, students’ age, and motivation into account. 

Although bivariate correlations suggested a potential association 
between professional teaching experience and dimensions of TSE 
toward individual students, multilevel analyses indicated that 
teaching experience only added to the prediction of student- 
specific TSE for student engagement and emotional support. This 
finding suggests that educators’ teaching experience particularly 
ameliorates their self-efficacy in the affective domain of teaching, 
including such tasks as providing emotional support and increasing 
individual students’ interest in schoolwork. Previous studies have 
supported this slight increase in more experienced teachers’ self- 
efficacy, both for affective domains as well as other areas of 
teaching and learning (Klassen & Chiu, 2010; Ross et al., 1996; 
Wolters & Daugherty, 2007). The potential value of teachers’ 
experience for their self-efficacy might explain, in part, why 
experienced teachers seem to be more effective in managing 
students’ behaviors and addressing their needs than inexperienced 
teachers (Kokkinos et al., 2004). 


The Moderating Role of Teaching Experience and 
Perceived Classroom Misbehavior 


In seeking to discern the moderating role of teachers’ experience 
and perceived classroom misbehavior, we noted that years of 
experience did not buffer or exacerbate the association between 
students’ social-emotional behaviors and teachers’ self-efficacy 
toward individual students. This is unlike the findings of Kokkinos 
et al. (2004), which seemed to suggest that teachers’ experience- 
induced behavioral knowledge, skills and awareness may buffer or 
exacerbate the potential negative relationship between challenging 
student behavior and TSE. However, results did point to a mod- 
eration effect of teachers’ perceptions of classroom misbehavior. 
Specifically, teachers in poorly behaving classrooms experienced 
lower levels of self-efficacy in managing the behavior of individ- 
ual students with externalizing conduct than in classrooms with 
fewer instances of misbehavior. This finding substantiates prior 
research), indicating that teachers may develop increasingly neg- 
ative attitudes toward their students in classrooms with many 
challenging students. It is important to note, however, the moder- 
ating role of teacher-perceived amounts of classroom misbehavior 
could not be ascribed to teachers’ appraisals of individual students’ 
externalizing behavior. In the present study, the zero-order corre- 
lation between misbehavior in class and ratings of externalizing 
student behavior was not significant. Hence, these findings under- 
line the relevance of considering characteristics of the classroom 
when investigating teachers’ beliefs of self-efficacy. 


Limitations 


The present study’s findings need to be interpreted in the con- 
text of several limitations. First, the correlational and cross- 
sectional nature of the study precludes any speculation on causal 
relations. Although our results provide preliminary support of the 
potential relationships between students’ behavior and TSE, it may 
well be that the nature of these associations are reciprocal. Indeed, 
Bandura’s (1997) model of triadic reciprocal causation asserts that 
teachers’ personal factors, their behaviors, and aspects of the 
classroom context may function as interacting factors that influ- 
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ence one another bidirectionally. Longitudinal, cross-lagged de- 
signs could advance our understanding of how individual students’ 
behaviors and teachers’ self-efficacy toward these students in 
various domains of teaching and learning influence one another 
across time. 

In relation to this issue, some caution is warranted when gen- 
eralizing the results of this study to other populations and settings. 
Specifically, this study relied on a sample of primarily experi- 
enced, female teachers who generally taught students with mid- to 
high socioeconomic backgrounds. These teachers, by virtue of 
their experience and more advantaged student population, may 
have felt more efficacious and better prepared to deal with their 
students across teaching domains. Including teachers from a wider 
range of backgrounds may result in a more reliable and general- 
izable picture of teachers’ self-efficacy in relation to particular 
students in different spheres of functioning. 

Second, teachers not only reported about their sense of self- 
efficacy toward individual students, but also about these students’ 
behaviors. As such, this study might have been threatened by 
shared source variance, resulting in an overestimation of the 
strength of associations. However, teachers’ self-efficacy is most 
likely constructed from information conveyed by experienced 
events in the classroom (Bandura, 1997). Given that teachers’ own 
experiences and self-knowledge are crucial sources of their self- 
efficacy, teacher reports may seem an adequate method of mea- 
suring students’ classroom behaviors. Still, it would be useful for 
future research to employ multiple methods, including interviews 
and observations, to further elucidate the present study’s findings. 

Third, although we made use of multilevel analysis to handle the 
clustering of students within teachers, we did not address the 
nesting of classrooms within schools. One reason for choosing to 
ignore a third level of nesting is that we generally found less than 
5% of the variance in TSE to be associated with the school-level 
of hierarchy, suggesting that teachers’ capability beliefs did not 
vary much across schools. Probably, this lack of variation might be 
explained by the fact that the 69 teachers who participated in this 
study were relatively evenly distributed across the 24 schools. 
Indeed, only two to three teachers per school decided to take part. 
Nevertheless, a number of studies on the sources of TSE has 
indicated that teachers’ self-efficacy may depend, in part, on 
aspects such as school atmosphere, principal leadership, and social 
support provided by parents and colleagues (e.g., Cheung, 2008; 
Lee, Dedrick, & Smith, 1991; Moore & Esselman, 1992; 
Tschannen-Moran & Woolfolk Hoy, 2007). With this in mind, it 
may be important to include such school contextual influences at 
the school-level of analysis when investigating teachers’ self- 
efficacy beliefs. 

Fourth, it is possible that the relations discovered in this study 
emanate from a common relation with contextual or structural 
features of the classroom context. Although we were able to 
* account for differences between teachers in their gender, years of 
experience, and perceived classroom misbehavior, there might 
have been other important between-teachers factors that we did not 
include in this study. For instance, teachers’ collegial support 
(Brownell & Pajares, 1999; Ciani, Summers, & Easter, 2008), 
instructional quality and classroom management (Holzberger, 
Philipp, & Kunter, 2013), and perceived work pressure (Leroy, 
Bressoux, Sarrazin, & Trouilloud, 2007) have been shown to be 
associated with teachers’ sense of self-efficacy. Thus, in any 


1023 


attempt to replicate the results, it is recommended that future 
researchers should take account of classroom and teacher charac- 
teristics to explain between-teachers differences in TSE. 

Last, teachers’ perceptions of self-efficacy were characterized 
by relatively high means and small standard deviations, suggesting 
the existence of social desirability bias. Generally, social desir- 
ability has been presumed to generate more flattering reports about 
the self and a limited range of answers (Goffin & Gellatly, 2001). 
This potential bias in teachers’ responses have also been noted in 
prior research on teachers’ domain-specific self-efficacy at the 
classroom-level (e.g., Heneman et al., 2006), and might have 
weakened the associations with students’ behaviors in this study. 


Conclusion 


Despite its limitations, the present study has demonstrated the 
theoretical and practical relevance of studying TSE in relation to 
individual students’ social-emotional behaviors across various 
domains of teachers’ functioning. Teachers’ self-efficacy has long 
been conceptualized as a relatively stable teacher characteristic 
which, at best, may be dependent upon particular teaching tasks 
and domains (Raudenbush et al., 1992; Tschannen-Moran & 
Woolfolk Hoy, 2001). Our results show, however, that most of the 
variance in TSE occurred within teachers, suggesting that these 
capability beliefs may also vary over the particular students they 
teach. Central contributors to such self-efficacy fluctuations seem 
to be both prosocial and challenging student behaviors, and exter- 
nalizing behavior in particular. Notably, these behaviors not only 
appear to relate to teachers’ perceived effectiveness in providing 
behavioral and affective support during reciprocal student-teacher 
interchanges, but their TSE in delivering instruction as well. This 
is an important finding, given that teachers’ dealings with individ- 
ual students’ misbehavior are likely to come at the expense of 
high-quality instructional activities and student-teacher interac- 
tions (e.g., Arbeau & Coplan, 2007; Sutherland & Oswald, 2005). 

The results of the present study, if they are replicated in future 
studies, may have several implications for educational researchers 
and practitioners alike. First, the ways teachers appraise and inte- 
grate individual students’ behavior into student-specific self- 
efficacy judgments may play an important role in teachers’ pre- 
paredness and motivation to deal with a particular child (Bandura, 
1997). Assumedly, educators who perceive themselves as unable 
to teach and affectively support a child have a tendency to shy 
away from these children or slacken their efforts when the goings 
get tough. Teachers must be made aware that such behaviors and 
actions may have serious implications for the academic and social— 
emotional adjustment of challenging students, and externalizing 
children in particular. Specifically, children with externalizing 
problems may become easily frustrated or unhappy about their 
teachers’ lack of instructional or emotional support, and may 
express these feelings by acting more aggressively toward the 
teacher in future situations (cf. Stipek & Miles, 2008). As such, 
teachers’ self-efficacy beliefs toward disruptive students and as- 
sociated behavior and actions may serve as an additional risk 
factor for poor quality student-teacher relationships and students’ 
social-emotional and academic maladjustment in school. Yet, the 
importance of teachers’ confidence in their ability to provide 
internalizing students with adequate emotional support should also 
not be underestimated. These capability beliefs may serve as 
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important tools for helping students with internalizing symptoms 
to come out of their shell and to navigate the social world. Thus, 
helping teachers to reflect on the effects of their cognitions about 
externalizing and internalizing children may be vital to improving 
the quality of students’ and teachers’ shared interactions and 
experiences. 

Second, the dynamic interplay between students’ disruptive 
behaviors and TSE may not only hamper students’ academic 
adjustment, but may also result in increased levels of emotional 
labor, daily stress, and burnout in teachers (e.g., Chang, 2013; 
Hargreaves, 1998; Spilt et al., 2011). This suggests that teacher 
training and development programs must incorporate strategies 
that teachers might use to bolster their self-efficacy in relation to 
individual (disruptive) students, including goal setting, behavior 
management, and providing emotional support. These activities 
may allow teachers to gain more pleasant emotional experiences 
with, and social feedback from their students, resulting in less 
stress and higher TSE (Spilt et al., 2011). 

In conclusion, it behooves educational researchers and practitioners 
alike to further investigate the complex ways in which teachers’ 
self-efficacy in relation to individual students with externalizing and 
internalizing symptoms and their subsequent behaviors and actions 
toward them affect students’ motivation, conduct, and achievement in 
the classroom. Viewing teachers’ self-efficacy from a dyadic perspec- 
tive may be a first step forward. 
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Changes in the demands posed by increasingly complex workplaces in the 21st century have raised the 
importance of nonroutine skills such as complex problem solving (CPS). However, little is known about . 
the antecedents and outcomes of CPS, especially with regard to malleable external factors such as 
classroom climate. To investigate the relations between CPS and other constructs, we had Finnish 
6th-grade students complete a test battery that included CPS tasks, fluid reasoning, classroom climate, 
and academic outcomes such as school grades and academic potential (V = 1,670). The working memory 
test was administered to a subsample of students (V = 357). A latent multilevel analysis suggests that 
(a) fluid reasoning, working memory, and classroom climate influenced CPS skills, and (b) CPS skills 
exhibited some incremental value in explaining school grades after controlling for cognitive ability, 
although the largest part of CPS’ relations to the outcomes was due to its overlap with other cognitive 
abilities. Further, on the class level, classroom climate showed a significant indirect effect on school 
grades via its influence on between-class differences in CPS. On the basis of this pattern of results, we 
argue that classroom climate is likely to be an important antecedent of CPS skills. Hence, we suggest that 
future research further explore how CPS is related to malleable factors such as classroom climate and 
extend analyses on the predictive validity of CPS to include real-world outcomes beyond the academic 
setting. 


Keywords: complex problem solving, fluid reasoning, working memory, classroom climate, school 


grades 


“How can we best prepare students for later job success?” This 
question represents one of the greatest challenges currently faced 
in education (Mayer, 2003; OECD, 2012). It is especially chal- 
lenging because the demands encountered on the job seem to be 
changing substantially. Data gathered over several decades by the 
U.S. Department of Labor suggest that increases in the computer- 
ization and automation of tasks are associated with a reduction in 
the amount of routine human labor and an increase in the number 
of nonroutine problem solving tasks (Autor, Levy, & Murnane, 
2003). Whereas in the 1960s, many tasks in the average workplace 
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demanded routine responding to similar, recurring situations, to- 
day, 50 years later, routine tasks have largely left human work- 
places and are now often performed by computers or machines. 
Instead, an increase in the frequency of nonroutine tasks involving 
complex problem solving skills has been observed (Greiff, 
Wiistenberg et al., 2014). 

A number of recent efforts in educational contexts have been 
initiated to define and assess complex problem solving skills and 
to develop educational programs and training interventions that 
will increase these skills (e.g., Greiff, Wiistenberg et al., 2014). 
Arguably, the most comprehensive effort in contemporary educa- 
tional surveys to assess emerging nonroutine skills has been un- 
dertaken by the Programme for International Student Assessment 
(PISA). In the 2012 PISA survey (OECD, 2014), for the first time 
ever, Complex Problem Solving (CPS) was included as a nonrou- 
tine skill (labeled creative problem solving), thus complementing 
the traditional PISA domains of mathematics, science, and reading. 
PISA is implemented in 3-year cycles and assesses the skill levels 
of hundreds of thousands of students in over 70 countries. With 
particular regard to CPS, the PISA 2012 framework (OECD, 2013) 
emphasized the idea that nonroutine skills such as CPS are nec- 
essary for future learning across the life span (see Baker & O’ Neil, 
2002) and are key for successfully meeting the shift in demands in 
the workplace described above. Thus, a substantial amount of 
assessment time in numerous countries was dedicated to CPS in 
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PISA 2012. For instance, results from PISA 2012 revealed that 
approximately 8% of students in OECD countries or economies 
were allocated to the lowest of seven problem solving proficiency 
levels as they were able to solve only straightforward problems 
using trial-and-error. The majority of students reached Level 3 
(approximately 57%) and was able to plan ahead, monitor their 
progress, and try different options. Only 2.5% of students reached 
the highest level and showed, for instance, the ability to modify 
their strategies, take all constraints into account, and apply flexi- 
ble, multistep plans (OECD, 2014). Further, results from PISA 
2012 revealed that CPS was strongly correlated with mathematics 
(r = .81), science (r = .78), and reading (r = .75; OECD, 2014). 
However, the OECD (2014) noted that “these correlations may 
appear large, but they are smaller than the correlation observed 
among mathematics, reading and science” (p. 68) and that this 
result “clearly proves that problem solving constitutes a separate 
domain from mathematics, reading and science” (p. 69). 

Despite widespread agreement about the relevance of nonrou- 
tine skills in general and CPS in particular, the definition and 
measurement of these skills still present significant challenges. In 
particular, these skills are rarely assessed in a psychometrically 
rigorous way, for instance, due to dependent indicators, low reli- 
ability, or the unwanted influence of prior knowledge (see Kroner, 
Plass, & Leutner, 2005; Wiistenberg, Greiff, & Funke, 2012). The 
current study was aimed at advancing the understanding of the 
assessment, the antecedents, and the outcomes of CPS by focusing 
on two research questions. First, we examined the relations be- 
tween CPS and other relevant constructs. Specifically, for the first 
time in CPS research, we examined the relations between CPS, 
two theoretically relevant cognitive abilities (i.e., fluid reasoning 
and working memory) from the Cattell-Horn-Carroll theory on 
human intelligence (see Keith & Reynolds, 2010), and classroom 
climate as a malleable external factor. Second, we asked whether 
CPS could measurably add to the explanation of important out- 
comes. Specifically, we evaluated whether CPS could explain 
variance in external variables such as academic achievement mea- 
sured by school grades and teacher-rated academic potential above 
and beyond other cognitive measures and classroom climate, thus 
providing a conservative test of the added value of CPS skills. To 
do so, we used latent structural equation modeling of data obtained 
from 1,670 Finnish sixth-grade students. 


Complex Problem Solving: Definition and Recent 
Empirical Findings 


A problem is defined as a situation in which a given state differs 
from a goal state, and no routine method for reaching the goal state 
is available (Mayer & Wittrock, 2006; Newell & Simon, 1972). 
The subsequent process of problem solving is aimed at transform- 
ing the original state into the desired goal state (Funke, 2010). This 
’ formulation of problem solving was previously mentioned by the 
famous Gestaltist Duncker (1945), whose candle problem consti- 
tutes one of the most prominent intransparent problems used in 
studies on insight (Ollinger, Jones, & Knoblich, 2008).’ In such 
tasks, the problem cannot be solved with a step-by-step method, 
and the problem solver suddenly realizes the solution to the prob- 
lem (e.g., Ollinger et al., 2008). Whereas early problem solving 
research had focused on noninteractive tasks such as the candle 
problem, the specific field of nonroutine CPS skills is centered on 
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the ways people interact with and strategically explore a new 
problem situation step by step (Autor et al., 2003). Buchner 
(1995)) defined CPS as: 


The successful interaction with task environments that are dynamic 
(i.e., change as a function of user’s intervention and/or as a function 
of time) and in which some, if not all, of the environment’s regular- 
ities can only be revealed by successful exploration and integration of 
the information gained in that process (p. 14). 


According to Funke (2001) and the OECD (2013), mastering a 
complex problem involves: (a) extracting relevant, but at the 
outset, hidden information; (b) thus establishing a representation of 
the problem, which is continuously updated by integrating feed- 
back; and (c) applying procedural abilities to understand and 
control a dynamically changing situation. Novick and Bassok 
(2005) delineated two CPS subskills: (a) deriving a viable repre- 
sentation of the problem at hand, as only incomplete information 
of the underlying problem structure is available in the beginning 
(i.e., knowledge acquisition); and (b) solving the problem, as a 
specific pattern of interventions to reach the desired goal states 
must be derived and carried out (i.e., knowledge application). 
These subskills are conceptually as well as empirically strongly 
correlated (e.g., Wiistenberg et al., 2012) and are considered 
largely independent of rote learning and specific factual knowl- 
edge (Mayer & Wittrock, 2006). 

In line with these two subskills (i.e., knowledge acquisition and 
knowledge application), learners have to first acquire knowledge 
about a new problem situation and subsequently apply this novel 
information when solving a complex problem. That is, flexibly 
adapting to a new problem situation and learning, integrating, and 
using new information are central aspects of CPS (Funke, 2010; 
Greiff et al., 2013). Recent empirical findings have suggested that 
CPS skills are relevant for explaining academic achievement. 
Wiistenberg et al. (2012) and Schweizer, Wiistenberg, and Greiff 
(2013) reported moderate relations between CPS skills and cogni- 
tive ability and the incremental value of CPS for explaining 
variance in school grades. Sonnleitner, Brunner, Keller, and Mar- 
tin (2014) suggested that CPS measures allow for a less biased 
assessment of student populations who suffer from a low level of 
language mastery (i.e., students with an immigration background) 
compared with classical tests of scholastic achievement. Thus, not 
only are nonroutine CPS skills considered to be conceptually 
relevant to education in the 21st century (cf. Autor et al., 2003; 
OECD, 2013), but there have been a number of empirical indica- 
tions that CPS skills yield empirically demonstrated added value 
even though accounts of this added value have relied on single 
studies, thus limiting their scope to specific aspects of the validity 
of CPS (e.g., its divergent validity from single cognitive abilities 
such as fluid reasoning). Indeed, these two aspects (i.e., conceptual 
relevance and recent empirical findings) were the driving factors 
behind the inclusion of CPS skills in the international PISA 2012 
assessment. 


‘In the candle problem, a candle has to be attached to a wall with the 
help of a matchbox and a box of thumbtacks in such a way that the candle 
wax will not spill on the table below. The solution is to put the candle in 
the box and to use the thumbtacks to attach the box to the wall. 
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So far, we have used the term CPS skills to describe an indi- 
vidual’s level of performance in solving new problems without any 
further reflection on the term skill. Anderson (2009) considers a 
skill anything that can be learned and, with regard to CPS, this 
terminology implies that CPS skills can be actively developed. 
Fischer and Bidell (1998) noted that “skills do not spring up . . .; 
they are built up gradually through the practice of real activities in 
real contexts and are then gradually extended to new contexts” (p. 
478). Skills vary immensely in their degree of complexity (e.g., 
basic motor skills vs. complex cognitive skills; cf. Anderson, 
2009), but their common core is that they can be learned and, 
hence, actively fostered. 

The question that arises from these considerations is whether 
CPS falls in line with other skills, thus implying that it can be 
fostered and developed through education and training systems. 
Mayer and Wittrock (2006) implicitly suggested that education has 
an effect on a student’s level of problem solving by stating that 
“one of educational psychology’s greatest challenges [is to make] 
students . . . better problem solvers” (p. 299). Also in the PISA 
studies, problem solving has been presented as a skill, and the 
trainability and learnability of CPS skills were repeatedly empha- 
sized in the PISA 2012 theoretical framework for CPS (OECD, 
2013). Despite the general notion that CPS can be molded through 
education and training, and thus can be identified as a skill, 
existing empirical studies have been silent on this matter (see 
Greiff, Wiistenberg et al., 2014). 

In this study, we augmented existing research by investigating a 
pattern of relations between several constructs (antecedents and 
outcomes) and CPS skills, whereas previous studies have focused 
on only some of these constructs and usually only one at a time. As 
will be outlined later, we empirically considered for the first time 
whether CPS can be influenced by classroom climate as an exter- 
nal factor beyond measures of cognitive abilities (cf. the subsec- 
tion on antecedents of CPS skills) in addition to analyzing the 
utility of CPS for explaining school grades and measures of 
academic potential above and beyond ability measures. In doing 
so, we used CPS tasks developed in the same framework as the 
PISA 2012 tasks. 


Antecedents of CPS Skills 


The development of cognitive skills such as CPS is likely to 
depend on both the cognitive abilities that underlie the recognition, 
framing, and solution of complex problems (e.g., McGrew, 2009) 
and the conditions under which the learning and practicing of these 
skills take place (e.g., Adelman & Taylor, 2005). Whereas cogni- 
tive abilities represent individual differences that are relatively 
resistant to change (Barnett, 2004), other factors such as classroom 
climate are more malleable, for instance, through interventions that 
target specific teacher attitudes and behaviors in class (Adelman & 
Taylor, 2005; Emmer & Stough, 2001; Wubbels & Brekelmans, 
1998). To this end, in our investigation of the antecedents of CPS 
skills, we focused on both stable individual differences and aspects 
of the classroom climate that are more amenable to intervention. 


Cognitive Abilities 


Conceptually, the most comprehensive theory on cognitive abil- 
ities is the Cattell-Horn-Carroll (CHC) theory of human intelli- 
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gence (McGrew, 2009), which integrates the Cattell-Horn theory 
of fluid and crystallized intelligence (e.g., Cattell, 1971; Horn & 
Noll, 1997) and Carroll’s three-stratum theory (Carroll, 1993). The 
CHC theory incorporates the lifetime efforts of some of the most 
renowned cognitive ability researchers and provides scholars “with 
the first empirically based consensus Rosetta stone from which to 
organize research and practice” (McGrew, 2005, p. 171). It pro- 
poses one very general factor of cognitive ability (g on stratum III), 
which influences several broad cognitive abilities including crys- 
tallized intelligence, fluid reasoning, working memory, visual pro- 
cessing, auditory processing, long-term storage and retrieval, cog- 
nitive processing speed, decision and reaction speed, reading and 
writing, and quantitative knowledge, all of which are found on 
stratum II. Narrow cognitive abilities are located on stratum I, and 
specific tests of cognitive ability usually test for these stratum I 
abilities (e.g., inductive reasoning on stratum I is seen as a valid 
indicator of the stratum II ability of fluid reasoning; McGrew, 
2009). The CHC theory is exceptionally relevant for understanding 
human cognitive performance and has received a considerable 
amount of attention in the educational arena in explorations of the 
potential to teach and train relevant skills. In fact, cognitive abil- 
ities as described in the CHC theory are considered important 
prerequisites for almost any cognitive skill. In their general over- 
view, Reeve and Hakel (2002) concluded that both theoretically 
and empirically, cognitive abilities as described in the CHC theory 
are relevant in almost any field of human performance and learn- 
ing, including the acquisition of skills. 

However, Raven (2000) and Funke (2010) have pointed out that 
cognitive abilities may not fully explain the acquisition of skills 
such as CPS. To this end, Leighton (2004) stated that complex 
problems are composed of specific characteristics that are not 
found in cognitive abilities such as fluid reasoning or working 
memory. Specifically, Raven (2000) pointed out that problem 
solving skills involve experimental interactions with the environ- 
ment in order to identify the nature of a problem. Thus, cognitive 
abilities considered in the CHC theory should explain CPS skills to 
some extent, but CPS skills are also likely to be influenced by 
variables other than abilities. Raven (2000) and Wiistenberg et al. 
(2012) have elaborated in detail on the conceptual differences 
between the two concepts CPS and fluid reasoning. 

Fluid reasoning covers the core of human cognitive ability 
(Carroll, 1993) as it encompasses mental operations such as draw- 
ing inferences, identifying relations, understanding rules, or com- 
prehending implications; measures of fluid reasoning are strongly 
related to the overall cognitive ability factor g on stratum III. Many 
of the operations that are integral to fluid reasoning may be related 
to complex problem solving (Wiistenberg et al., 2012); however, 
fluid reasoning is of such a general nature that it can be considered 
relevant to virtually any cognitive skill (cf. McGrew, 2009). Work- 
ing memory? includes the simultaneous storage and processing of 


* The terms working memory and short-term memory (STM) are some- 
times used interchangeably in research on the CHC theory (e.g., Keith & 
Reynolds, 2010). We use the term working memory, as it is regarded as 
more complex in that it contains both a storage component and the function 
of maintaining memory representations (e.g., Daneman & Carpenter, 
1980). Also, McGrew (2009) stressed the importance of working memory 
for understanding new learning and the performance of complex cognitive 
tasks. 
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information in the face of concurrent processing or distraction 
(e.g., Daneman & Carpenter, 1980). Wirth and Klieme (2003) 
highlighted the importance of working memory for CPS as “the 
underlying structure of the problem is complex, and the amount 
of relevant information exceeds the capacity of working mem- 
ory” (p. 332). Thus, depending on a person’s working memory, 
the amount of new information this person can process when 
solving a complex problem may vary and, in this way, impact 
CPS skills (Schweizer et al., 2013). 

To this end, several empirical studies have supported the theo- 
retical assumption of moderate relations between cognitive abili- 
ties and CPS skills. Fluid reasoning, which is considered to be at 
the heart of intelligence (Carroll, 1993) and the CHC theory 
(McGrew, 2009), has been found to account for up to 40% of the 
variance in CPS skills (e.g., Gonzalez, Vanyukov, & Martin, 2005; 
Sonnleitner, Keller, Martin, & Brunner, 2013; Wiistenberg et al., 
2012) and was a significant predictor of CPS in a longitudinal 
analysis (Frischkorn, Greiff, & Wiistenberg, 2014). Working 
memory was found to account for up to 20% of the variance in 
CPS (Schweizer et al., 2013). However, these studies have limited 
their analyses to one stratum II ability only (ie., either fluid 
reasoning or working memory). 

Biihner, Kroner, and Ziegler (2008) conducted the only study 
that related CPS skills to more than one cognitive ability at a time. 
They used the two stratum II abilities working memory and visual 
processing to explain variance in CPS skills and found that visual 
processing did not exhibit any additional influence on CPS skills 
after controlling for working memory. However, the path coeffi- 
cients reported by Bihner et al. (2008) were surprisingly small 
(only between 7% and 19% of the variance in CPS was explained 
by the two cognitive abilities combined). 

In the current study, for the first time ever, we related fluid 
reasoning and working memory to CPS skills simultaneously. The 
CHC theory has identified these as important cognitive abilities 
(McGrew, 2009) that may be the most relevant for explaining 
performance in cognitive skills such as CPS. These abilities thus 
_provide a stricter test of the hypothesis that assessments of CPS 
skills offer value over and above assessments of the most relevant 
cognitive underpinnings of complex problem solving. As fluid 
reasoning and working memory refer to theoretically different 
cognitive abilities (McGrew, 2009) and are empirically distinct 
(Ackerman, Beier, & Boyle, 2005), we expected fluid reasoning 
and working memory as general cognitive dispositions on stratum 
II to each explain unique aspects of the development and overall 
level of CPS skills. 


Classroom Climate 


Because of the relative stability of core cognitive abilities, the 
potential and the expected benefit of actively fostering cognitive 
“abilities such as fluid reasoning are limited. In fact, studies that 
have reviewed interventions and training programs that were de- 
signed to increase cognitive abilities have reported only small 
effects or no effects at all (e.g., Barnett, 2004; Herrnstein & 
Murray, 1994). At the same time, schooling, with its numerous 
formal and informal learning opportunities, has been found to 
enhance cognitive performance and to teach students an almost 
infinite number of noncognitive and cognitive skills (Rutter & 
Maughan, 2002). The hope behind the curricula of many schools is 


1031 


that the parts of CPS skills that cannot be explained by cognitive 
abilities are—directly or indirectly—influenced by external vari- 
ables that are under the control of the school or the educational 
system behind it, 

For instance, aspects of the school environment such as 
classroom climate have long been considered crucial for skill 
development and acquisition (Fraser & Fisher, 1982; Haertel, 
Walberg, & Haertel, 1981). Conceptually, Fisher and Fraser 
(1981) identified four dimensions of classroom climate: cohe- 
sion (i.e., the degree to which students feel a sense of belong- 
ing), competition (i.e., the degree to which students compete 
with classmates), friction (i.e., the degree to which students do 
not get along and are unfriendly to one another), and task 
orientation (i.e., the degree to which students are orderly and 
complete their work on time; see also Goh & Fraser, 1998). 
Empirically, these dimensions are somewhat distinct but highly 
correlated (Goh & Fraser, 1998). Although Fisher and Fraser’s 
(1981) initial work was published three decades ago, these 
dimensions of classroom climate are still part of contemporary 
questionnaires that are used to capture students’ learning envi- 
ronment (see Adamski, Fraser, & Peiro, 2013). 

The climate of a classroom emerges from complex social inter- 
actions between students and between students and teachers, but 
the teacher plays a paramount role in this process and in shaping 
the classroom climate (Jennings & Greenberg, 2009; Wubbels & 
Brekelmans, 1998). Classroom climate affects skill development 
and knowledge acquisition, and the same is likely to be true for 
CPS skills. A generally positive classroom climate should encour- 
age students to unfold their potential and maximize their eagerness 
to approach new problem situations. That is, in order to develop 
their CPS skills in school or in training, students have to actively 
engage with new problem situations and explore these situations 
on a regular basis. To this end, classroom climate may influence 
whether or not students feel encouraged to tackle new problems 
and whether or not they feel invited to test novel solutions and 
ideas in order to resolve a given problem. For instance, it may be 
the case that a negative classroom climate undermines students’ 
efforts to tackle new problems, whereas a positive classroom 
environment results in higher levels of students’ perceived security 
and enables them to actively approach and explore new problem 
situations, even though the outcome of the approach may be 
uncertain. As CPS skills depend on active participation and direct 
experience with new problem situations, we argue that a positive 
classroom climate helps students to optinially develop their CPS 
skills. 

In summary, our theoretical model includes the exogenous vari- 
ables fluid reasoning, working memory, and classroom climate, 
which we expected to be significantly related to CPS. Thereby, 
classroom climate was expected to explain differences across 
classes in CPS performance. 


Hypothesis 1: Fluid reasoning and working memory will 
explain variance in CPS skills on the individual level, whereas 
classroom climate will explain variance in CPS skills on the 
class level. 


Outcomes Associated With CPS Skills 


Besides identifying antecedents of CPS skills (Hypothesis 1), 
it is also important to understand how CPS skills are related to 


1032 


outcomes. For instance, the OECD (2013) stressed that CPS 
skills provide “a basis for future learning” (p. 7) and, thus, 
should be related to a number of educational and socioeconomic 
outcomes on the individual level (cf. Autor et al., 2003). In a 
similar vein, Baker and O’Neil (2002) argued that CPS skills 
amplify learning across the life span, thus rendering it highly 
important for later success in life. Clearly, for any skill that is 
considered to be of crucial relevance in the real world, it is 
expected that a measurable impact of this skill on criteria with 
real-world implications can be shown even after controlling for 
other cognitive abilities. 

Empirically speaking, measures of many other skills have 
failed to fulfill these expectations, but the first findings on CPS 
skills have been encouraging. CPS skills have been found to be 
related to academic achievement to a substantial degree and to 
yield incremental explanatory power above and beyond cogni- 
tive abilities such as fluid reasoning (e.g., Greiff, Kretzschmar, 
Miiller, Spinath, & Martin, 2014; Greiff et al., 2013; Sonnleit- 
ner et al., 2013; Wiistenberg et al., 2012) and working memory 
(Schweizer et al., 2013). That is, across studies, CPS has 
explained an additional 1% to 11% of the variance in school 
grades beyond fluid reasoning (e.g., Greiff, Kretzschmar et al., 
2014; Wiistenberg et al., 2012), and it has explained an addi- 
tional 11% in science grades and 5% in social studies grades 
beyond working memory (Schweizer et al., 2013). Thus, we 
expected that CPS would explain variance in school grades and 
academic potential on the individual level beyond fluid reason- 
ing and working memory. 

With regard to external factors, there are numerous studies 
that have shown that the way teachers act and how they support 
their students (sometimes also labeled classroom management; 
Brophy, 2006) influence classroom climate and are related to a 
number of outcomes such as students’ achievement and their 
motivation (e.g., Hamre & Pianta, 2005; Klem & Connell, 
2004; Sanders, Wright, & Horn, 1997). More recently, Fauth, 
Decristan, Rieser, Klieme, and Biittner (2014) analyzed longi- 
tudinal data from more than 1,000 students in 54 classes and 
showed that ratings of cognitive activation and supportive cli- 
mate predicted students’ development of subject-related inter- 
est, and classroom management predicted student achievement 
on a science test. More specifically, with a multilevel regression 
analysis, Fauth et al. (2014) showed that the class-level vari- 
ables classroom management (8 = .37) and teacher popularity 
(8 = .17) explained 22% of the class-level variance in students’ 
science performance. Hence, we expected that classroom cli- 
mate would be related to the outcomes of school grades and 
academic potential. Further, if classroom climate was found to 
be related to CPS (see Hypothesis 1), we expected that there 
would be a significant indirect effect from classroom climate 
via CPS to the outcomes (i.e., school grades and academic 
potential). A good classroom climate should enable students to 
gather new knowledge, which is captured by CPS skills and is 
needed in any educational setting. 


Hypothesis 2: CPS skills will explain variance in academic 
potential and school grades beyond fluid reasoning and work- 
ing memory on the individual level. On the class level, class- 
room climate will be directly related to school grades and 
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academic potential and also indirectly related to these out- 
comes via its influence on CPS. 


Method 


Participants 


The initial sample included all 2,029 students attending the sixth 
grade in one Finnish municipality (984 male, 988 female, 57 
missing sex; age: M = 12.25, SD = 0.46).° Students were nested 
in 112 different classes. All students worked on CPS and fluid 
reasoning: and completed a questionnaire on demographic data 
including school grades. Working memory was additionally as- 
sessed in a random subset of approximately 20% of the overall 
sample (N = 415; 191 male, 216 female, eight missing sex; age: 
M = 12.29, SD = 0.49). All teachers (V = 112) were asked to fill 
out a questionnaire on classroom climate. In addition, teachers 
gave an assessment of the academic potential of each student. 
Testing took place in the schools’ computer rooms, and all tests 
were administered online except working memory, for which the 
items were read aloud to the students. 

Test application via the Internet led to some technical problems, 
thus resulting in lost data, particularly on the first days of testing. 
A total of N = 45 students who were missing information about 
which class they belonged to were excluded from the multilevel 
analyses. Another V = 314 students who had no data at all on CPS 
(i.e., all data on each of the 18 CPS indicators were missing) and 
most other constructs due to technical problems during the online 
assessment were also excluded. Thus, data from 1,670 students 
were available for the analyses involving CPS, fluid reasoning, and 
school grades (821 male, 837 female, 12 missing sex; age: M = 
12.24, SD = 0.45). Data from 357 students were used for the 
analyses on working memory (N = 357; 156 male, 193 female, 
eight missing sex; age: M = 12.29, SD = 0.49). Students were 
nested in 100 different classes, thus providing sufficient data for 
multilevel analyses. Seventy-two teachers (72%) returned the 
questionnaire on classroom climate and the assessment of aca- 
demic potential. 


Materials and Scoring 


CPS. To measure students’ CPS skills, we employed a com- 
puterized assessment of performance on a number of independent 
complex problem tasks using the MicroDYN platform (Greiff et 
al., 2013; Wiistenberg et al., 2012). In line with the conceptual 
differentiation of CPS skills into knowledge acquisition and 
knowledge application (see above; Novick & Bassok, 2005), when 
they worked on the tasks, students first had to acquire knowledge 
about a new problem situation and subsequently apply this knowl- 
edge in order to reach a given goal. The CPS tasks represented 
complex problems as they confronted the students with intrans- 
parent and dynamically changing situations. That is, not all infor- 


° The data were drawn from a panel study. The sample has already been 
used in other publications (Krkovic, Greiff, Kupiainen, Vainikainen, & 
Hautamaki, 2014; Vainikainen, 2014; Wiistenberg, Stadler, Hautamaki, & 
Greiff, 2014). Vainikainen (2014) and Krkovic et al. (2014) did not use 
data on CPS at all. Wiistenberg et al. (2014) focused on strategic behavior 
in CPS. The analyses that we applied to test Hypotheses | and 2 are unique. 
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mation about how to solve the problem was given at the outset 
(i.e., intransparency; see Funke, 2010), and the problem situations 
changed as a result of the students’ actions or as a function of time 
(i.e., dynamics). Hence, no routine solution could be applied to 
master these problems (Funke, 2010). 

MicroDYN as a measure of CPS skills was found to yield good 
psychometric properties (Greiff et al., 2013). It has also shown 
sufficient convergent and divergent validity (Schweizer et al., 
2013; Wiistenberg et al., 2012). In the PISA 2012 survey, Micro- 
DYN tasks were used to assess CPS skills in 15-year-old students 
in over 50 countries worldwide (OECD, 2009). 

To understand how students must display their CPS skills when 
working on MicroDYN tasks, consider the task “Handball train- 
ing” in Figure 1 as an illustration. Before they begin the task, 
students are asked to imagine that they are the new coach of a 
handball team and that they should find out how the intensity of 
different training methods (input variables labeled Trainings A, B, 
and C; left part of Figure 1a) influence important performance 
indicators of their team (output variables labeled motivation, 
power of the throw, exhaustion; right part of Figure 1a). In the first 
phase of knowledge acquisition (Figure la), students are asked to 
explore the relations between the different types of training and the 
performance indicators by changing the intensity of the training 
methods (i.e., by moving sliders and clicking the “Apply” button; 
left side of Figure 1a) and by observing the corresponding changes 
in the performance indicators (right side of Figure 1a). However, 
besides being affected by the direct effects of training intensity, the 
performance indicators may also change by themselves as time 
passes (i.e., simultaneously with clicks), which is a characteristic 
feature of complex problems (cf. Funke, 2001; Raven, 2000). For 
instance, motivation may gradually decrease over time without any 
intervention. While they evaluate the causal relations between the 
types of training and performance, the students are instructed to 
graphically depict their mental representation of the problem sit- 
uation in a causal diagram (cf. concept maps; bottom of Figure 1a). 
In the example, each type of training has an effect on at least one 
performance indicator. 







Find out about the relationships and 
plot them in the model! 





Reach the given target area in no more than 
! 
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In the second phase of knowledge application (Figure 1b), on 
the basis of the knowledge acquired in the previous phase and the 
correct model displayed at the bottom of the screen (cf. bottom of 
Figure 1b), students are asked to combine the different types of 
training (i.e., Trainings A, B, C; left side of Figure 1b) to try to 
reach a given level in the performance indicators (i.e., motivation, 
power of the throw, exhaustion; right side of Figure 1b). Red target 
areas and numbers mark the target values for the performance 
indicators (right side of Figure 1b). Due to the scheduling of the 
next match, the number of training units and, thus, the number of 
possible intervention steps is limited to four. 

Handball training is one of nine tasks that are given in Micro- 
DYN. Thus, after receiving detailed instructions and working on a 
trial task, students have several opportunities to display their CPS 
skills during a test session that lasts about 45 min. Specifically, 
besides coaching a handball team, students face tasks in which 
they have to find out the rules of a new board game, use fertilizers 
to grow vegetables in a garden, or repair and safely guide a 
stranded space shuttle home (Greiff et al., 2013). That is, CPS 
tasks differ in the contexts in which they are embedded and also in 
their difficulty, ranging from easy to difficult. All tasks are de- 
signed and labeled in such a way that the influence of prior 
knowledge is minimized, and performance relies on students’ CPS 
skills only. For example, students who play handball or even 
students who have coached a handball team in real life do not have 
an advantage over students with different backgrounds because 
input labels either do not have deep semantic meaning (e.g., 
Training A) or are fictitious (e.g., Solurax as the name of a 
fertilizer). 

For scoring purposes, indicators of knowledge acquisition and 
knowledge application for each of the nine tasks were derived in 
Phases 1 and 2, respectively. For knowledge acquisition, credit 
was given if a student’s causal diagram was entirely correct; 
otherwise, zero credit was assigned. Credit was given for knowl- 
edge application if the predefined goal values were achieved, 
whereas no credit was assigned if target values were not achieved, 
thus resulting in a total of 18 indicators (for details on scoring, see 


four Ro 5 





Training A 






Training B 









Figure I. Screenshot of the MicroDYN task Handball training. The knowledge acquisition phase is displayed 
on the left (Figure 1a). The knowledge application phase is displayed on the right (Figure 1b). See text for further 
details. See the online article for the color version of this figure. 
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Greiff et al., 2013). We used these manifest indicators to form one 
higher order CPS skills factor (see statistical analyses). Linear 
equations that specify the optimal strategies for performing well on 
all tasks are given in the Appendix. 

Fluid reasoning. Fluid reasoning as a stratum II ability in the 
CHC theory was assessed with two separate reasoning tasks. One 
of the tasks measured Piagetian reasoning and the other deductive 
reasoning. Both tasks adhered to the more narrow level of stratum 
I in the CHC theory and were combined to form an overarching 
fluid reasoning factor. The Piagetian reasoning task was a five- 
item version of Bond’s (1995) Logical Operations Test. It opera- 
tionalizes each of the schemas of the formal operational stage 
identified by Inhelder and Piaget (1958). Each multiple-choice 
item is comprised of two to four short sentences followed by a set 
of alternative responses (e.g., “A prospector has found that some 
rich metals are sometimes found together. In his life he has 
sometimes found gold and silver together; sometimes he has found 
silver by itself; every other time he has found neither silver nor 
gold. Which of the following rules has been true for the prospec- 
tor? Gold and silver are found together, never apart; If he found 
silver then he found gold with it; If he found gold then he found 
silver with it; If he found gold then he didn’t find silver’). All five 
items are coded as right or wrong. Bond’s Logical Operations Test 
has been used extensively in Australia but also in several studies 
conducted in other English-speaking countries (Endler & Bond, 
2006). 

The deductive reasoning task was an adaptation of the Ross Test 
of Higher Cognitive Processes (Ross & Ross, 1976). In each of the 
items, students are presented a fact (i.e., a premise) and a conclu- 
sion (e.g., conclusion: “Lake Saimaa is too cold for swimming;” 
first fact: “The temperature of Lake Saimaa is 5°C“). Students are 
then asked to choose the second fact (premise) from among several 
alternatives to make the conclusion valid as a whole (e.g., “Most 
lakes are too cold for swimming;” “It is wintertime;” “Five degree 
water is too cold for swimming;” “Lake Saimaa is always cold;” 
“Swimming in cold water is no fun”). There were five items, each 
coded as right or wrong. The test has been widely used to assess 
higher order thinking in students (of the same age as the partici- 
pants in the present study) both internationally (e.g., Hopson, 
Simms, & Knezek, 2001; Paul & Nosich, 1993) and in Finland. 
Since 1996, it has been used in Finland to examine the cross- 
curricular outcomes of education as a part of the national model for 
educational evaluation (see Hautamaki, Hautamaki, & Kupiainen, 
2010). 

Working memory. Working memory as a stratum II ability 
was measured with a test of arithmetical working memory on the 
more narrow level of stratum I. This test is an adaptation of the 
mental arithmetic subtest of the Wechsler Adult Intelligence Scale 
(Wechsler, 1939) and has been validated extensively in Finnish 
large-scale educational assessment studies (for the most recent 
description of the scales, see Vainikainen, Hautamaki, Hotulainen, 
& Kupiainen, 2015). Teachers read aloud eight arithmetical prob- 
lems one at a time (e.g., “If you buy two bus tickets and one ticket 
costs 3 euros 50 cents, how much money do you get back if you 
give 10 euros?”), and students wrote down their answers within 
strict predefined time limits. The tasks target students’ ability to 
concentrate and keep information available in working memory 
while manipulating mental mathematical problems. Based on the 
framework by Oberauer, Sii8, Schulze, Wilhelm, and Wittmann 
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(2000), each task represents working memory in the functional 
factor storage and transformation of information in the numerical 
content category. According to Lynn and Irwing’s (2008) meta- 
analysis, it is primarily regarded as a working memory task. The 
eight items were scored as right versus wrong and were combined 
to form an overall working memory factor. 

Classroom climate. For the measure of classroom climate, 
teachers were asked to complete a questionnaire that assessed the 
overall classroom climate with a focus on the theoretical dimen- 
sion of task orientation. This questionnaire has been used across a 
period of several years in national and municipal Finnish large- 
scale studies aimed at describing between-class differences in 
classroom climate (Marjanen, Vainikainen, Kupiainen, Hau- 
tamaki, & Hotulainen, 2014). The questionnaire consists of five 
items that were scored on a 7-point Likert scale ranging from 
strong disagreement to strong agreement (i.e., “Students in my 
class follow the common rules of the class well;” “The social 
atmosphere in my class allows the students to try hard and suc- 
ceed;” “Working independently during the lessons works very well 
in my class;” “Group work during the lessons works very well in 
my class;” “It is quiet in my class so that everybody can concen- 
trate on his/her work”). Teacher reports were used instead of 
student questionnaires because teacher reports have been shown to 
be more sensitive to capturing classroom-level factors (Mitchell, 
Bradshaw, & Leaf, 2010). 

School grades and assessment of academic potential. We 
obtained students’ grades in two main subjects (mathematics and 
chemistry). Grades followed the usual scale used in Finnish 
schools ranging from 4 (insufficient) to 10 (excellent) and were 
used as indicators of a latent school-grade factor. As an assessment 
of academic potential, teachers evaluated students’ chances of 
success in comprehensive school (“I believe that she/he will do 
fine until the end of comprehensive school”) on a 7-point Likert 
scale ranging from strong disagreement to strong agreement. 


Procedure 


Testing was divided into two sessions. In the first session, 
students provided demographic data, including school grades, and 
worked on a test battery comprised of fluid reasoning, working 
memory (if applicable), as well as other cognitive tests and attitude 
scales (e€.g., a questionnaire including attitudes toward learning and 
achievement) that were not used in this paper (90 min overall). At 
the same time, teachers filled out the questionnaire that included 
the questions on classroom climate and an assessment of students’ 
academic potential. In the second session, which was conducted 
approximately 1 week after the first session, students worked on 
the CPS tasks (45 min). 


Data Analysis Plan 


First, we screened our sample (N = 1,670) for missing data. The 
percentages of missing data were 0.1% to 29.6% for the 18 
indicators of CPS, 1.6% to 3.7% for the 10 indicators of fluid 
reasoning, 79.5% to 82.9% for the 10 indicators of working mem- 
ory (because working memory was administered to approximately 
only 20% of the sample by design), 3.4% for math grade, and 5.5% 
for chemistry grade. 

We used Mplus Version 7 (Muthén & Muthén, 2010) to replace 
the missing data by means of multiple imputation. Multiple impu- 
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tation replaces each missing value with a set of plausible values, 
thereby generating multiple data sets that are used in subsequent 
analyses (see Schafer, 1999). There is consensus among research- 
ers that multiple imputation is superior to traditional methods such 
as listwise or pairwise deletion (Manly & Wells, 2014), with which 
a considerable amount of valuable information would have been 
lost. 

A prerequisite for conducting multiple imputation is that the 
data are missing completely at random (MCAR) or at least missing 
at random (MAR). We used SPSS to conduct Little’s MCAR test, 
which revealed that the data were missing completely at random if 
we included all variables used in our models (x? = 17011.386, 
df = 17755, p = .999). Because we wanted to ensure that the 
working memory data, which were available for only a subsample 
of N = 357, were not missing systematically, we decided to test an 
additional model with only fluid reasoning and working memory. 
We chose fluid reasoning in addition to working memory because 
it had only a little missing data and was highly correlated with 
working memory, thus reducing noise and increasing the proba- 
bility of detecting that the data were MCAR. The data were also 
MCAR in the analysis with only fluid reasoning and working 
memory (x* = 1614.223, df = 1531, p = .068), implying that the 
working memory subsample did not differ significantly from the 
full sample. Next, we imputed missing data for all constructs 
except working memory. We refrained from imputing missing data 
on working memory because working memory was administered 
to only 20% of the sample, and imputing data with more than 50% 
missing values is not advisable (see Manly & Wells, 2014). How- 
ever, we included working memory as an auxiliary variable in 
addition to all of the cognitive measures and scales (i.e., CPS, fluid 
reasoning, school grades, academic potential, classroom climate) 
in the imputation model. That is, the missing data on the working 
memory indicators were not imputed but were used as correlates to 
provide better estimates of the missing data in the estimation 
process (see Muthén & Muthén, 2010). 

The imputation model distinguished between the within-level 
(i.e., individual-level) and between-level (i.e., class-level) vari- 
ables and therefore included a multilevel structure as recom- 
mended by Muthén and Muthén (2010). We generated five im- 
puted data sets using Bayesian estimation with H1 imputation, 
which is one of the standard procedures in MPlus 7 for this kind of 
analysis. Mplus 7 provided pooled parameter estimates across the 
five data sets for all subsequent analyses. 
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The (subsequent) analyses that were based on the five imputed 
data sets used multilevel structural equation modeling (SEM; 
Bollen, 1989), which took the nested structure of our data into 
account, as students were nested in classes. That is, in the analyses 
for both Hypotheses 1 and 2, students’ data were modeled on the 
individual level, whereas teachers’ data (i.e., classroom climate) 
and random intercepts of constructs measured at the individual 
level were modeled on the class level. 

We applied weighted least squares means and variance adjusted 
(WLSMV) estimation because our items were categorical. The 
WLSMV estimator uses all available information and applies a 
pairwise present approach so that the models were based on the 
full sample (NV = 1,670), but all parameters that involved working 
memory were based on the reduced sample of N = 357. We report 
standard model fit indices such as the confirmatory fit index (CFI) 
and the root mean square error of approximation (RMSEA), en- 
dorsing the cut-off values recommended by Hu and Bentler (1999) 
for excellent fit: RMSEA (<.06) and CFI (>.95). We applied 
one-tailed tests of significance for the directional Hypotheses 1 
and 2. 


Results 


Descriptive Statistics, Measurement Models, and 
Intraclass Correlations 


The descriptive statistics, including means, standard deviations, 
and manifest correlation coefficients, are shown in Table 1. All 
descriptive statistics were based on raw data without imputation. 
As expected, all cognitive constructs were significantly related to 
each other. The only nonsignificant correlation was obtained for 
classroom climate and fluid reasoning, r = .055, p = .06. On a 
descriptive level, the results seemed to be consistent with our 
hypotheses and provided a good starting point for more detailed 
analyses. 

For all subsequent analyses, latent SEM models based on im- 
puted data were applied. Measurement models for each construct 
with more than three items are displayed in Table 2. Each latent 
factor was scaled by restricting the factor loading of the first item 
to 1. CPS was modeled as a two-dimensional construct including 
knowledge acquisition and knowledge application, whereas all 
other constructs were modeled as one-dimensional. All measure- 


Means, Standard Deviations, and Correlations Between Constructs 


a 





Number of 
Construct items M SD Scale 1 2 3 ~ 5 6 di N 
1. Knowledge acquisition (CPS) 9 il PAD, Oxon — 1,670 
2. Knowledge application (CPS) 9 wll Bu eOito 1 .470™* — 1,573 
3. Fluid reasoning 10 48 ie OKto 1 B4ies03/ _ 1,661 
4. Working memory 8 45 28 Otol i325 E376 a) 425 — 357 
5. Classroom climate* 5 S,06p 05a emduto: 7 Oe SIS OSE) sO" — 222) 
6. School grades 2 Too) Cpa 10 3207." 520. ADA. AST 072" ry 1,623 
7. Assessment of academic potential 1 514 1.36“ 1to7 DADS E2307 3308 Ue 364 F126 549 — 1,594 


Note. 


correlations were computed on manifest means of constructs without imputation. 
“Students in the same grade were assigned the same value on classroom climate to compute the correlations. 


piv 05) ara pi— 01k 


Manifest means and SDs were calculated across the respective number of items in each scale using the original data without imputation. Bivariate 
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Table 2 
Goodness of Fit Indices for CFA Measurement Models 


WUSTENBERG, GREIFF, VAINIKAINEN, AND MURPHY 


se 





Model Number of items x? df Pp CFI RMSEA N 
CPS: 18 334.045 134 <.001 972 .030 1,670 
Fluid reasoning 10 59.262 35 .006 .976 .020 1,670 
Working memory 8 21.009 20 400 998 012 357 
Classroom climate 5 10.209 5 .070 5955 029 Ta 


Note. df = degrees of freedom; CFI = Comparative Fit Index; RMSEA = Root Mean Square Error of Approximation; Model parameters were averaged 
over the imputations; x? and df for classroom climate were estimated with the Maximum Likelihood Robust (MLR) estimator, x” and df for all other 


variables containing categorical indicators were estimated by WLSMV. 


* For CPS, a two-dimensional model with knowledge acquisition (nine items) and knowledge application (nine items) was applied. The two dimensions 
were significantly correlated (r = .80, p < .01). All other models were unidimensional. 


ment models showed excellent fit according to the CFI and RM- 
SEA values, and all items loaded statistically significant on the 
corresponding factors (CPS: B = .184 to .872; fluid reasoning: 
6B = .237 to .693; working memory: B = .480 to .735; classroom 
climate: 8 = .460 to .712). The measurement models based on the 
collected data without imputation showed comparable results in 
terms of the path coefficients and did not differ in terms of the fit 
indices from the results obtained with the imputed data. 

An anonymous reviewer requested that we test alternative mea- 
surement models for fluid reasoning and working memory as well 
as the two CPS dimensions. As the difference test procedure in 
Mplus that is necessary to compare models with the WLSMV 
estimator is not available for imputed data sets (see Muthén & 
Muthén, 2010), we compared alternative models using the nonim- 
puted raw data. A two-dimensional model with fluid reasoning and 
working memory (x* = 201.521, df = 134, p < .001; CFI = .958; 
RMSEA = .017) revealed a significantly better fit than a one- 
dimensional model (x? = 226.574, df = 135, p < .001; CFI = 
.943; RMSEA = .020) in which the fluid reasoning and working 
memory items were combined under one general factor (x7- 
difference test in Mplus: x? = 12.297, df = 1, p < .001). A 
two-dimensional model of CPS with knowledge acquisition and 
knowledge application (x? = 316.299, df = 134, p < .001; CFI = 
._977; RMSEA = .029) revealed a significantly better fit than a 
one-dimensional model (x* = 451.787, df = 135, p < .001; CFI = 
.961; RMSEA = .037; x?-difference test: x* = 94.468, df = 1, p< 
.001). Please note that the x? values of the two models could not 
be directly subtracted to compare them because computing the 
differences of two values and dfs between models is not appropri- 
ate when WLSMV estimation is applied (Muthén & Muthén, 
2010). 

Internal consistencies were calculated using McDonald’s (1999) 
@ coefficient (1999) with w = (>d,)?/([>A,]? + D8), where X, 
were the factor loadings and 6,; the residual variances. Factor 
loadings and residual variances were based on standardized esti- 
mates using the MLR estimator. The results showed acceptable to 
excellent reliabilities for all constructs (i.e., knowledge acquisi- 
tion: w = .91; knowledge application: w = .80; fluid reasoning: 
w = .70; working memory: w = .83; classroom climate: w = .83). 

Before testing the hypotheses, we followed the recommended 
procedure for multilevel SEM and created an unconditional (null) 
model, in which all constructs were correlated without specifying 
any predictions, to identify item-level intraclass correlations 
(ICCs). ICCs are estimates of the class-level variance divided by 
the total variance (i.e., within-level plus between-level variance; 


see Dyer, Hanges, & Hall, 2005). The ICC:was .192 for academic 
potential, and the ICCs ranged from .008 to .128 for CPS, from 
.001 to .068 for fluid reasoning, from .005 to .254 for working 
memory, and from .071 to .093 for school grades, indicating that 
the majority of the variance was found at the individual level and 
not at the class level. However, there was also sufficient variance 
across levels to warrant the application of multilevel methods (see 
Tabachnick & Fidell, 2007). 


Antecedents of CPS Skills 


For Hypothesis 1, we expected that fluid reasoning and working 
memory would be significantly related to CPS skills on the indi- 
vidual level, whereas classroom climate was expected to explain 
variance in CPS skills between classes. For this analysis, we used 
a random intercept multilevel model that took the nested structure 
of the data into account (see Muthén & Muthén, 2010). That is, 
fluid reasoning and working memory served as independent vari- 
ables, and the second-order CPS factor served as the dependent 
variable on the within level. On the class level, classroom climate 
was used as the independent variable, and the intercept of the CPS 
factor was allowed to vary randomly across classes. In the model, 
the random intercept of the second-order CPS factor was regressed 
on classroom climate. Note that CPS was modeled as a first-order 
factor on the class level (i.e., the knowledge acquisition and 
knowledge application items loaded on one factor, cf. Figure 2) 
and not as a second-order factor. Research using multilevel ap- 
proaches has frequently shown that the numbers of factors of a 
latent construct decreases at higher levels (Muthén & Muthén, 
2010), and this was also the case in this study (not reported). 

The multilevel model displayed in Figure 2 showed an excellent 
fit (x? = 932.375, df = 818, p = .003; CFI = .969, RMSEA = 
.009). Results on the within level revealed that fluid reasoning 
(8 = .409, SE = .143, p = .002) and working memory were 
significantly related to CPS (8 = .314, SE = .171, p = .033) and 
that both constructs explained a substantial amount of variance in 
CPS (R* = #462: cf Figure 2). On the class level, classroom 
climate was related to performance in CPS skills as expected (B = 
473, SE = 122, p < .001; Réetween = .225), showing that variance 
in CPS performance between classes was explained by classroom 
climate. 

In summary, fluid reasoning and working memory were related 
to CPS, and a better classroom climate yielded a higher CPS 
performance, supporting Hypothesis 1. 
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Figure 2. N = 1,670. Multilevel regression model in which fluid reasoning and working memory explain 
variance in CPS on the within level and classroom climate explains variance in CPS on the class level. In the 
within part of the model, the darkened circles at the ends of the arrows pointing toward the CPS items represent 
random intercepts that are referred to in the between part of the model. In the between part of the model, the 
random intercepts are shown in circles because they are continuous latent variables that vary across clusters (i.e., 
classes). The random intercepts are indicators of the between part of CPS. Path coefficients were standardized, 
and one-tailed p values were applied. Only the path coefficients that included working memory were based on 


N = 357.*p < 05." p <.01. 


Outcomes Associated With CPS Skills 


For Hypothesis 2, we expected that CPS skills would exhibit 
incremental validity in explaining variance in both school grades 
and academic potential above and beyond fluid reasoning and 


working memory on the individual level. On the class level, we- 


expected that classroom climate would be directly related to school 
grades and academic potential and indirectly related to school 
grades and academic potential via its influence on CPS. 

All subsequently reported models were random intercept mul- 
tilevel models that took the nested structure of the data into 
account. In Model A, fluid reasoning, working memory, and CPS 
were modeled as exogenous variables that explained variance in 
school grades and academic potential on the individual level. On 
the class level, the exogenous variable classroom climate was 
directly related to the random intercepts of school grades and 
academic potential and indirectly related to them via classroom 
climate’s relation to CPS (cf. Hypothesis 1). Random intercepts 
displayed between-class differences in CPS, school grades, and 
academic potential. All intercepts were allowed to vary randomly 
across Classes. 

The overall fit of Model A was excellent (x? = 1132.170, df = 
1011, p < .001; CFI = .974, RMSEA = .008). On the within level, 
variance in school grades was explained by fluid reasoning (8 = 
389, SE = .114, p < .001), working memory (6 = .305, SE = 
+ 152, p = .022), and CPS (8 = .145, SE = .073, p = .023). Fluid 
reasoning was significantly related to academic potential (8 = 
.213, SE = .117, p = .034), but working memory (B = .196, SE = 
.146, p = .088) and CPS (B = .097, SE = .068, p = .075) were 
not. 

On the class level, classroom climate was related to class-level 
differences in CPS modeled as the random intercept (8 = .476, 
SE = .122, p < .001). These differences, in turn, were signifi- 
cantly related to school grades (B = .412, SE = .249, p = .048) but 


not to academic potential (8 = .180, SE = .177, p = .155). 
Further, classroom climate was not directly related to school 
grades (8 = .207, SE = .265, p = .218), but it was related to 
academic potential (8 = .329, SE = .160, p = .020). 

To estimate the incremental value of CPS on the within level, 
we descriptively compared the R* values of Model A, including 
three exogenous variables (i.e., fluid reasoning, working memory, 
CPS), with Model B, including two exogenous variables (only 
fluid reasoning and working memory). In Model A, the three 
exogenous variables explained 57.6% of the variance in school 
grades and 21.0% of the variance in academic potential. In Model 
B, 56.3% of the variance in school grades and 20.4% of the 
variance in academic potential were explained by fluid reasoning 
and working memory (overall model fit: x? = 273.157, df = 210, 
p < .001; CFI = .976, RMSEA = .013). Thus, when comparing 
Models A and B, CPS accounted for an additional 1.3% of the 
variance in school grades and showed no incremental value in 
explaining academic potential (i.e., nonsignificant paths). 

However, the incremental value of CPS could also be shown by 
using one alternative model (i.e., Model C, see Figure 3) that had 
the advantage of directly testing the statistical significance of the 
unique effect of CPS. In this Model C, on the individual level, fluid 
reasoning and working memory were modeled to explain variance 
in CPS. The residual variance in CPS was allocated to the CPS 
residual that was not explained by and, thus, was fully independent 
of fluid reasoning and working memory (i.e., CPS,., in Figure 3). 
Further, CPS... was modeled to explain variance in school grades 
and working memory beyond the variance in these outcomes that 
was explained by fluid reasoning and working memory. CPS... 
was by definition uncorrelated with fluid reasoning and working 
memory, thereby providing the best way to illustrate the incremen- 
tal validity of CPS beyond fluid reasoning and working memory. 
Identical to its relations in Model A, the exogenous variable 
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Figure 3._N = 1,670. Random intercept multilevel regression model in which school grades and academic 
potential were regressed on fluid reasoning, working memory, and the residual of CPS (CPS,,,) on the within 
level. On the class level, classroom climate was modeled to have a direct effect on school grades and academic 
potential as well as an indirect effect on school grades and academic potential via classroom climate’s influence 
on CPS. CPS,., captured the part of variance in CPS not explained by fluid reasoning and working memory. In 
the within part of the model, the darkened circles at the ends of the arrows pointing toward the items math grade, 
chemistry grade, and academic potential represent random intercepts that are referred to in the class-level part 
of the model. In the class-level part of the model, the random intercepts are shown in circles (e.g., school grades) 
because they are continuous latent variables that vary across clusters (i.e., classes). Path coefficients were 
standardized, and one-tailed p values were applied. Path coefficients with p > .05 are depicted as dotted lines. 
Path coefficients with p < .05 are depicted as solid lines. Only the path coefficients that included working 
memory were based on N = 357. * p < .05. * p < .01. 


classroom climate was directly related to the random intercepts of 
school grades and academic potential and indirectly related via its 
relation to CPS (cf. Hypothesis 1). 

As Model C differed from Model A only with regard to how the 
CPS variance was divided, the overall fit of Model C was identical 
to the fit of Model A (x? = 1132.170, df = 1011, p < .001; CFI = 
.974, RMSEA = .008). On the within level, the variance in CPS 
was explained by fluid reasoning (8 = .409, SE = .142, p = .002) 
and working memory (8 = .315, SE = .169, p = .032; cf. Figure 
3). Overall, the incremental validity results were comparable to the 
results obtained for Model A, but the path coefficients for CPS 
when explaining variance in school grades (8 = .106, SE = .056, 
p = .029; AR* = .01) and academic potential (8 = .071, SE = 
.051, p = .079) were slightly lower than in Model A (i.e., CPS on 
school grades: B = .145, SE = .073, p = .023; CPS on academic 
potential: 8 = .097, SE = .068, p = .075). In this way, Model C 
offered a conservative approach for testing the incremental validity 
of CPS because the variance common to CPS, fluid reasoning, and 


working memory was not captured by CPS... but was fully attrib- 
uted to fluid reasoning and working memory, respectively. Results 
on the class level were identical to Model A (see Figure 3). 

In summary, both the two-step procedure (i.e., comparing Mod- 
els A and B) and the modeling of CPS... (ie., Model C) showed 
on the individual level that CPS incrementally explained a small 
amount of variance in school grades but was not related to aca- 
demic potential beyond fluid reasoning and working memory. On 
the class level, classroom climate had a direct effect on academic 
potential and an indirect effect on school grades via between-class 
differences in CPS. Thus, Hypothesis 2 was supported only in part. 


Discussion 


In a search for a deeper understanding of CPS, we asked about 
the antecedents and outcomes of CPS skills. Using a large sample 
of Finnish sixth-grade students, we first determined how two 
cognitive abilities, fluid reasoning and working memory, as well as 
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classroom climate were related to CPS skills. Fluid reasoning and 
working memory explained CPS skills. In addition, classroom 
climate explained variance in CPS skills between classes, thus 
legitimating the assumption that both cognitive abilities and ex- 
ternal factors play important roles as CPS skills evolve. 

The implications of CPS skills for external outcomes were 
indicated by substantial correlations between CPS skills and 
school grades and an assessment of academic potential (cf. Table 
1). However, the results for incremental validity were not as 
expected. CPS showed no incremental value in explaining variance 
in academic potential, and, although CPS added significant vari- 
ance to the prediction of school grades, the effect was small. That 
is, the results suggested that to the extent that CPS was related to 
academic outcomes on a bivariate level, this was largely a conse- 
quence of its overlap with measures of cognitive ability. 


Antecedents of CPS Skills 


In Hypothesis 1, we expected that fluid reasoning and working 
memory would explain variance in CPS skills on the individual 
level and that classroom climate would explain variance between 
classes in CPS skills. Our results supported Hypothesis 1. More 
specifically, 46% of the variance in CPS skills was explained by 
both fluid reasoning and working memory on the individual level. 
On the class level, 23% of the variance was explained by class- 
room climate. 

Not unexpectedly, fluid reasoning as the stratum II ability to 
perform general mental operations correctly (Carroll, 1993) was 
significantly related to CPS. According to McGrew (2009), fluid 
reasoning includes the mental processes that are needed to solve 
basic novel problems that cannot be performed automatically, and 
Sternberg and Berg (1986) suggested that basic problem solving is 
part of almost any definition of fluid reasoning. 

Working memory as the stratum II ability to keep and process 
new information about a complex problem (Wirth & Klieme, 
2003) was also related to CPS. This result affirms but also extends 
findings by Schweizer et al. (2013) and Biihner et al. (2008). 
Schweizer et al. (2013) reported a substantial correlation between 
CPS skills and working memory but did not include measures of 
fluid reasoning. Biihner et al. (2008) stressed the importance of 
working memory in CPS skills even after controlling for visual 
processing as another stratum II ability. However, visual process- 
ing is related to a rather narrow aspect of cognitive ability. It is 
defined as “the ability to generate, store, retrieve, and transform 
visual images and sensations” (McGrew, 2009, p. 5). In the current 
study, we showed for the first time that working memory is still 
related to CPS even beyond fluid reasoning, which is presumably 
one of the most central abilities for human cognitive performance 
(McGrew, 2009). 

In addition to the effect of cognitive abilities, the external factor 

“classroom climate had a significant impact on CPS skills on the 
class level, explaining a substantial amount of variance (i.e., 23%). 
Classroom climate is considered a major factor in classroom 
learning (Djigic & Stojiljkovic, 2011). In fact, the way teachers 
manage the classroom and shape a positive or negative climate is 
among the most important predictors of student achievement ac- 
cording to a meta-analysis by Wang, Haertel, and Walberg (1993). 
The results of our study suggest that if there is a good classroom 
climate, students are better able to develop their CPS skills. Stu- 
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dents may be more willing to approach new problem situations, to 
actively explore them, and to come up with nonroutine solutions if 
classroom climate is positive. 

Classroom climate is considered to evolve in a complex social 
interaction that involves the teacher and the students (e.g., Jen- 
nings & Greenberg, 2009), but usually the teacher is able to 
actively shape the classroom climate to a significant degree 
through specific behaviors. For instance, Allred (2008) suggested 
seven strategies for building positive classrooms such as creating 
a classroom code of conduct that ensures a common understanding 
of positive and negative behaviors, reinforcing positive behaviors, 
or emphasizing the idea that learning is relevant for students’ own 
success. Empirically, Miller and Pedro (2006) showed that teach- 
ers’ actions have an impact on classroom climate, which subse- 
quently impacts students’ achievement (Adelman & Taylor, 2005). 
This impact that teachers have on classroom climate, and in turn, 
the impact of classroom climate on CPS skills suggest that CPS 
skills can be influenced—at least to a certain extent—and are thus 
under the control of external variables that can be changed, hence 
justifying the conception of CPS as a skill. 

To enhance CPS skills, one well-known but not commonly 
adopted method is for teachers to enable students to actively 
participate in and tackle new problems by themselves or in groups 
as opposed to teachers’ mere application of ex-cathedra teaching. 
The importance of active participation had already been proposed 
in the late 1980s by Chickering and Gamson (1987) who intro- 
duced seven principles for good practice in undergraduate educa- 
tion including the use of active learning techniques. More recently, 
Ambrose, Bridges, DiPietro, Lovett, and Norman (2010) empha- 
sized the idea that helping students to become self-directed learn- 
ers constitutes one core aspect of smart teaching. In the same vein, 
Chi (2009) stated that interactive and active involvement such as 
asking and answering questions when solving problems leads to 
deep understanding, whereas passive behavior such as mere lis- 
tening creates only a superficial understanding of problems. This 
applies even more to CPS skills than to factual domain knowledge, 
as CPS skills by definition include the acquisition and application 
of new knowledge while interacting with new problem environ- 
ments. However, the exact mechanisms and processes underlying 
the relations between teachers’ classroom management, classroom 
climate, and students’ CPS skills have yet to be identified. 

In summary, our results show that both cognitive abilities and 
the classroom climate as an indicator of the learning environment 
are related to CPS skills. We expect that the same relations will 
hold across the life span. In particular, environments that are 
supportive and flexible are likely to contribute to the development 
of CPS skills. 


Outcomes Associated With CPS Skills 


Results on the relations between CPS and the outcomes showed 
that CPS was highly correlated with school grades and academic 
potential and, in a latent multilevel regression model, explained 
additional variance on the within level in school grades (AR? = 
.01) but not in academic potential. In addition, classroom climate 
showed only an indirect relation to school grades (but not aca- 
demic potential) on the between level via its influence on CPS. 

On the individual level, CPS showed significant added value in 
explaining school grades, but the amount of additional variance 
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that it explained was rather small. In previous studies, the incre- 
mental validity of CPS ranged from explaining an additional 5% to 
11% of the variance in school grades beyond working memory 
(Schweizer et al., 2013) and 1% to 11% of the variance in school 
grades beyond fluid reasoning (Greiff, Kretzschmar et al., 2014; 
Wiistenberg et al., 2012). It is conceivable that different study 
designs as well as sample characteristics may account for these 
differences. In contrast to the earlier studies that we mentioned, 
our study (a) applied measures of two cognitive abilities imple- 
mented in the CHC theory instead of one, (b) included classroom 
climate as an additional predictor on the class level, and (c) 
employed a considerably larger sample that consisted of a more 
heterogeneous set of students than just students with high cogni- 
tive abilities such as in Schweizer et al. (2013) and Wiistenberg et 
al. (2012). 

From a practical perspective, the relevance of the empirically 
determined incremental validity of CPS is ambiguous. On the one 
hand, it is quite possible that the significant influence of CPS 
would be rendered negligible if the operationalization of cognitive 
abilities were broadened further by including additional stratum II 
abilities. On the other hand, however, we used two of the most 
influential cognitive abilities located on stratum II of the CHC 
theory, abilities that are foundational to CPS (see Schweizer et al., 
2013) and that are also strongly associated with academic out- 
comes. The latter was supported by the high correlations that fluid 
reasoning and working memory had with the outcomes (i.e., man- 
ifest: .33 to .46). It would be reasonable to believe that other 
cognitive skills might also not be able to offer further incremental 
value if large amounts of variance were found to be associated 
with the strongest predictor. For instance, in Evans, Floyd, 
McGrew, and Leforgee’s (2002) study, the influences of the stra- 
tum II abilities auditory processing, long-term storage and re- 
trieval, and cognitive processing speed on the reading comprehen- 
sion skills of 10- to 14-year-old students were nonsignificant when 
considered simultaneously with the more influential abilities crys- 
tallized intelligence and working memory. However, a result such 
as this one does not call into question the construct itself (i.e., 
long-term storage and retrieval in Evans et al., 2002; CPS in this 
paper). Nevertheless, the incremental value was smaller than ex- 
pected. 

By contrast, results on the between level better matched our 
expectations. The significant indirect effect from classroom cli- 
mate to school grades via class differences in CPS combined with 
the nonsignificant direct effect of classroom climate on school 
grades illustrated the importance of CPS. It seems that a beneficial 
environment gave students the opportunity to develop their CPS 
skills, which, in turn, affected their school grades. Indeed, a good 
classroom climate (e.g., good social atmosphere for individual and 
group work, cf. Method section) gives students the opportunity to 
gather new knowledge in a self-regulated way, which is at the heart 
of CPS. However, as we did not use a longitudinal design, inter- 
pretations remain preliminary, and more research is needed to 
investigate causal relations between classroom climate, CPS, and 
the outcomes. 


Limitations 


Some limitations in our study need further consideration. First, 
the data were collected on only students attending the sixth grade 
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in Finnish schools, and the study had a cross-sectional design, 
which means that causality could not be established. From a 
theoretical perspective, it seems reasonable to expect that general 
cognitive dispositions such as working memory and fluid reason- 
ing are predictors of CPS skills. However, a solid test of this 
hypothesis calls for a longitudinal study in which students’ skills 
and abilities are measured across several years. This would allow 
researchers to evaluate whether students’ reasoning skills at an 
initial measurement occasion can predict CPS performance in 
subsequent test sessions or whether initial CPS predicts later 
reasoning. 

Second, results on working memory were based on a randomly 
chosen subsample of 20% of the students (V = 357), yielding a 
low covariance coverage of the data. However, Little’s test re- 
vealed that the data were missing completely at random, implying 
that there was no relation between the pattern of missing data and 
any values that were observed or missing. Further, we applied the 
WLSMV estimator, which was designed especially for computa- 
tions in small and moderate sample sizes and provides robust 
parameter estimates for constructs with dichotomous variables 
(e.g., working memory in our study; Beauducel & Herzberg, 
2006). Beauducel and Herzberg (2006) and Flora and Curran 
(2004) showed that applying the WLSMV estimator resulted in 
accurate parameter estimates and standard errors under both nor- 
mal and nonnormal latent response distributions in simulation 
studies using sample sizes of N = 250 and N = 200, respectively. 

Third, our working memory measure assessed the working 
memory function simultaneous storage and manipulation in the 
numerical content domain. Hence, our measure did not capture the 
verbal or visuospatial components of working memory. This may 
have altered the relation between working memory and CPS. In 
fact, our results differed slightly from Schweizer et al.’s (2013), 
who applied the same approach for measuring CPS and reported 
latent correlations for a visuospatial working memory test with 
CPS knowledge acquisition (r = .46) and CPS knowledge appli- 
cation (r = .41). In our study, when we used a numerical working 
memory test in a full measurement model (not reported), we found 
a similar latent correlation between working memory and CPS 
knowledge acquisition (r = .46), but the correlation was higher 
between working memory and CPS knowledge application (r = 
.63). Thus, the relations between working memory and CPS 
knowledge application might have been overestimated in our study 
compared with Schweizer et al.’s (2013) results. However, we 
used a second-order CPS factor in our analyses, implying that the 
potential effect of overestimating the effect of working memory on 
CPS was reduced because the second-order factor captured only 
the variance that was shared between CPS knowledge acquisition 
and CPS knowledge application. 


Implications and Outlook 


Cognitive performance indicators can be differentiated into 
skills and abilities. However, no ability is fully predetermined (i.e., 
the notion of ability; Barnett, 2004), and no skill is shaped only by 
learning experiences (i.e., the notion of skill; Fischer & Bidell, 
1998). Depending on the location of a cognitive construct on the 
continuum between predetermined and learnable, it might be con- 
sidered either an ability or a skill. Recent conceptualizations in 
educational large-scale assessments have claimed that CPS is 
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composed of a set of learned and teachable skills (e.g., in the PISA 
survey; OECD, 2012). Indeed, from a theoretical perspective, 
learning is one of the key aspects of CPS because acquiring new 
knowledge in intransparent and complex situations and subse- 
quently using this knowledge to influence these situations is cen- 
tral to CPS skills. However, there has been no empirical support 
for this idea. To this end, this study is the first to paint a more solid 
picture of the understanding of CPS as a set of skills. The results 
indicate that CPS skills rely to a substantial degree on a person’s 
level of cognitive ability. At the same time, CPS also evolves 
through external aspects of the classroom climate, thus rendering 
it a learnable skill. 

In our opinion, the search for antecedents of CPS will keep 
researchers busy in the upcoming years. The PISA 2012 problem 
solving report (OECD, 2014) was published last year and has 
drawn the attention of several groups of stakeholders. It is quite 
conceivable that educationalists and politicians in nations with 
similar performance in math and science (e.g., Canada and Poland) 
but different performance in complex problem solving (i.e., Can- 
ada considerably outperformed Poland) may already wonder about 
the reasons for such results and how to provide sufficient CPS 
learning opportunities to prepare a student population for future 
challenges in life. Whereas the good news is that external factors 
have a substantial impact on CPS skills, the exact nature of these 
factors and the mechanisms underlying this influence have yet to 
be discovered. With the results of this study as a starting point, we 
encourage future research to reinforce the relevance of teacher 
behavior in creating a classroom climate that is conducive for 
learning (Wubbels & Brekelmans, 1998). However, given the 
amount of variance left unaccounted for in CPS skills, other factors 
besides classroom climate may have an impact on CPS skills. For 
instance, not only classroom climate at school but also more 
informal learning opportunities in the parental environment may 
be relevant to the development of CPS skills (see Greiff et al., 
2013). With respect to educational policy, it may be interesting for 
future research to reveal whether different educational systems 
produce teachers with different teaching styles (e.g., teacher- 
guided vs. discovery-oriented; Hsu, 2008) and whether such dif- 
ferences, in turn, affect students’ CPS skills. Some ideas about 
how to teach domain-general skills such as CPS were introduced 
by Greiff, Wiistenberg et al. (2014) in a theoretical position paper. 
The authors suggested, among other ideas, that students should be 
encouraged to focus on relevant information, to adequately repre- 
sent their acquired knowledge and to associate it with their extant 
knowledge, to choose appropriate actions and operators when 
trying to reach a goal, and to evaluate mental models and their 
validity in reference to evidence. 

With regard to the relation of CPS to important performance 
outcomes, research may benefit from investigating CPS in a nat- 
uralistic setting and from examining relations between CPS skills 
and outcomes other than pure academic performance such as 
school grades or academic potential. School grades may not be the 
ideal criterion for investigating the incremental value of CPS. 
Further research may wish to focus on real-world outcomes that 
are more closely associated with the core characteristics of CPS, 
that is, to gather and apply knowledge in unknown problem situ- 
ations. For instance, Danner, Hagemann, Schankin, Hager, and 
Funke (2011) examined the impact of CPS skills on occupational 
outcomes (i.e., supervisory ratings of overall job performance) and 


reported that CPS skills offered incremental explanatory power 
beyond fluid reasoning. Kersting (2001) found a correlation of .37 
between CPS and police officers’ job performance as rated by their 
supervisors. 

Indeed, combining research on CPS assessed with computer- 
based tasks in a laboratory with naturalistic decision-making in 
various settings (Osman, 2010) may provide valuable insights into 
the validity of computer-based CPS measurements. Further, al- 
though more difficult to acquire, real-world outcomes such as the 
problem solving performance of individuals who frequently have 
to solve problems may be better suited for investigating the incre- 
mental validity of CPS beyond other cognitive abilities. For in- 
stance, in the tradition of Danner et al. (2011), supervisory ratings 
of the overall job performance of experienced project leaders who 
face complex problems in their daily work life may be compared 
with the performance of individuals who are not well-acquainted 
with problem solving. 

In summary, this study extends initial starting points for a 
deeper understanding of CPS skills, its antecedents, and outcomes. 
Clearly, research on CPS may benefit from investigating CPS 
performance in a naturalistic setting. With regard to the anteced- 
ents of CPS, researchers still do not know how to best prepare 
students for later success in life and how to meet the greatest 
challenge education currently faces—to make students good prob- 
lem solvers (Mayer & Wittrock, 2006), but research efforts are 
moving in the right direction. The answer found at the end may be, 
after all, an old one newly discovered: Over 40 years ago, the 
eminent Hungarian mathematician George Pélya (1971) had al- 
ready described a good education as systematically giving students 
the opportunity to discover solutions to new problems them- 
selves—and researchers may find that doing this at school, at the 
work place, and even on the policy level is the answer needed to 
meet one of the greatest challenges of education in the 21st 
century. 
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Appendix 


Linear Structural Equations Underlying All Nine MicroDYN Tasks 


i 





Item Linear structural equations System size Effects 

1 Gh Xe Agee Be 2 X 1-System Only direct 

2 XG wl Xee te eAG tO, 2 X 2-System Only direct 
Vi sanseul otOe hehe B, 

3 Xe ole GOLA 2B, 2 X 2-System Only direct 
Yee olny or OeA EEE Bs 

4 NG eee ApoB a0, 3 X 2-System Only direct 
ee lear OPAy 0) sy ee 

5 Xe exe i OrARe 2B 0" 3 X 3-System Only direct 
Vahey, MOO A F705, 10°C, 
Zi NZe+0°A, +0°B + 2°C, 

6 Neca Nach OFA, 05B, ct O:C, 3 X 3-System Direct and indirect 
Yue Yet OA c+ OB ot BC Jct 3 

fi I IPD ea PINES Oe SSC, 3 X 3-System Only direct 
You 1 Y, + UA, + 2B OC. 
Zep ee OAS OB re. 

8 Xe LX 2 AL OnBO G 3 X 2-System Direct and indirect 
Ye = (VY-+ 0A, F 2B 01°C) 43 

9 X= LXE 107A, Ge OLB a EtOrE. 3 X 3-System Direct and indirect 
You —~ @Y, + 2A, + 2 Bae Oe ee 
Ley = VZ. + O°A, + OB, + 2°C, 


Note. X,, Y,, and Z, denote the values of the output variables, and A,, B,, and C, denote the values of the input variables in the present trial, whereas 
Xi+i> Ye+1, 2,41 denote the values of the output variables in the subsequent trial. 
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We introduce the concept of differential prediction generalization in the context of college admissions 
testing. Specifically, we assess the extent to which predicted first-year college grade point average (GPA) 
based on high-school grade point average (HSGPA) and SAT scores depends on a student’s ethnicity and 
gender and whether this difference varies across samples. We compared 257,336 female and 220,433 
male students across 339 samples, 29,734 Black and 304,372 White students across 247 samples, and 
35,681 Hispanic and 308,818 White students across 264 samples collected from 176 colleges and 
universities between the years 2006 and 2008. Overall, results show a lack of differential prediction 
generalization because variability remains after accounting for methodological and statistical artifacts 
including sample size, range restriction, proportion of students across ethnicity- and gender-based 
subgroups, subgroup mean differences on the predictors (i.e, HSGPA, SAT-Critical Reading, SAT- 
Math, and SAT-Writing), and SDs for the predictors. We offer an agenda for future research aimed at 
understanding several contextual reasons for a lack of differential prediction generalization based on 
ethnicity and gender. Results from such research will likely lead to a better understanding of the reasons 


for differential prediction and interventions aimed at reducing or eliminating it when it exists. 


Keywords: differential prediction, admissions testing, test fairness, test bias 
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As noted in the Standards for Educational and Psychological 
Testing (American Educational Research Association, American 
Psychological Association, and National Council on Measurement 
in Education, 2014), “the term predictive bias may be used when 
evidence is found that differences exist in the patterns of associ- 
ations between test scores and other variables for different groups 
. .. one approach examines slope and intercept differences between 
two targeted groups . . . while another examines systematic devi- 
ations from a common regression line for any number of groups of 
interest” (pp. 51-52). Similarly, the Principles for the Validation 
and Use of Personnel Selection Procedures (Society for Industrial 
and Organizational Psychology, 2003) state that “slope and/or 
intercept differences between subgroups indicate predictive bias” 


(p. 32). 
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The aforementioned widely adopted and standard definition of 
predictive bias, which is also labeled differential prediction, refers 
to a difference in the prediction of scores across subgroups and 
does not stipulate which group’s scores are under- or overpre- 
dicted. In other words, differential prediction also exists when the 
prediction of criteria is different across groups such that the 
minority group “benefits” from overprediction. In fact, although 
not within the context of educational testing, lawsuits regarding 
reverse discrimination in preemployment testing such as the Ricci 
v. DeStefano et al. (2009) U.S. Supreme Court case are based on 
this logic because majority and minority applicants are protected 
under Title VII of the Civil Rights Act of 1964. 

Aguinis, Culpepper, and Pierce (2010) revived the fairly dor- 
mant research domain of differential prediction and received sub- 
stantial media attention, including coverage by USA Today, The 
Economist, HR Magazine, and many other outlets. Thus, research 
on this topic is important for educational psychology and other 
fields concerned with high-stakes testing, such as human resource 
management and industrial and organizational psychology, as well 
as society at large. Aguinis, Culpepper, et al. (2010) stated that 
there is an “important opportunity for . . . researchers to revive the 
topic of differential prediction and make contributions with mea- 
surable and important implications for organizations and society” 
(p. 675). 

Following Aguinis, Culpepper, et al.’s (2010) call, several re- 
searchers have echoed the need for additional work regarding 
differential prediction in educational and preemployment contexts 
(Berry, Clark, & McClure, 2011; Berry, Sackett, & Sund, 2013; 
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Fischer, Schult, & Hell, 2013). Our study relies on a data-analytic 
approach similar to that used in investigations of validity gener- 
alization (i.e., the extent to which validity coefficients vary across 
studies) to introduce a new concept we label differential prediction 
generalization, which refers to the extent to which differential 
prediction varies across studies. Next, we offer a literature review 
and description of our study’s rationale, goals, and contributions in 
relation to previous research. 


Literature Review and Present Study 


The potential existence of differential prediction by gender and 
ethnicity has been investigated for several decades. For example, 
Cleary (1966) investigated data from three colleges, Pfeifer and 
Sedlacek (1971) analyzed data from 13 institutions, and Temp 
(1971) investigated 13 institutions. More recently, Mattern and 
Patterson (2013) examined differential prediction of the SAT by 
relying on a larger database. In the majority of these studies, 
differential prediction has been found, on average, to be small such 
that tests overpredict grades for Black and Hispanic students (e.g., 
Mattern & Patterson, 2013) and underpredict grades for female 
students (e.g., Ancis & Sedlacek, 1997). The majority of this body 
of work has focused on understanding the degree of differential 
prediction in specific institutions or the average degree of differ- 
ential prediction across institutions. 

A related but different line of research has addressed the extent 
to which validity coefficients (e.g., correlation coefficient between 
test scores and a criterion such as college grades) generalize (i.e., 
are similar) across contexts. This line of inquiry was motivated by 
research conducted in the 1960s (e.g., Ghiselli, 1966; Guion, 1965) 
suggesting that validity coefficients change from context to con- 
text and, therefore, are situation-specific. In a seminal article 
challenging this situational specificity hypothesis, Schmidt and 
Hunter (1977) offered an analytic approach called validity gener- 
alization or psychometric meta-analysis, which involves first as- 
sessing the degree of variability of validity coefficients across 
studies and then calculating the extent to which such variability 
may be substantive (supporting situational specificity) or, instead, 
because of methodological and statistical artifacts (supporting va- 
lidity generalization, Hunter & Schmidt, 2004). This two-step 
process is necessary because the observed variability of coeffi- 
cients across contexts may be because of factors such as sampling 
error, measurement error, and range restriction (Aguinis & Pierce, 
1998; Aguinis, Sturman, & Pierce, 2008).' In other words, these 
methodological and statistical artifacts can give the impression that 
there is a great deal of variability in correlation (i.e., validity) 
coefficients across studies, whereas in actuality this variability 
may be because of differences in sample size, measurement error, 
and range restriction. 

Since the introduction of validity generalization procedures by 
Schmidt and Hunter (1977), several studies have been conducted 
examining correlations in the context of educational and preem- 
ployment testing. For example, Linn, Harnisch, and Dunbar (1981) 
conducted a validity generalization study of the LSAT and its 
relation with first-year grades and reported that the majority of the 
variance in observed validity coefficients was explained by meth- 
odological and statistical artifacts. Similarly, in two separate stud- 
ies, Boldt (1986a, 1986b) conducted validity generalization anal- 
yses to understand whether the validity of the SAT and GRE 
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generalizes across colleges and universities and the overall con- 
clusion was that the correlation between these test scores and 
subsequent grades seems to generalize. 

Considering our current knowledge about differential prediction 
and the separate but related body of work on validity generaliza- 
tion points to a knowledge gap regarding the extent of differential 
prediction generalization. This knowledge gap is important be- 
cause, as noted by Linn (1978), “differences in prediction systems 
have a more direct bearing on issues of bias in selection than do 
differences in correlations” (p. 511). Specifically, validity gener- 
alization refers to whether the correlation between test scores and 
criteria is similar across contexts. In contrast, we conceptualize 
differential prediction generalization as the extent to which differ- 
ential prediction (i.e., differences in regression coefficients across 
groups) is similar across contexts. Thus, differential prediction 
generalization is different from validity generalization and highly 
informative because, as noted by the Standards for Educational 
and Psychological Testing, “correlation coefficients provide inad- 
equate evidence for or against a differential prediction hypothesis 
if groups or treatments are found to have unequal means and 
variances on the test and the criterion. It is particularly important 
in the context of testing for high-stakes purposes that test devel- 
opers and/or users examine differential prediction and avoid the 
use of correlation coefficients in situations where groups or treat- 
ments result in unequal means or variances on the test and crite- 
rion” (American Educational Research Association, American 
Psychological Association, and National Council on Measurement 
in Education, 2014, p. 66). 

From a theoretical perspective, our interest in differential 
prediction generalization is motivated by several possible 
sociohistorical-cultural and social psychological explanations for 
why the use of test scores in educational and employment settings 
to predict performance can differ based on a test taker’s ethnicity 
or gender and why differential prediction is unlikely to be similar 
(i.e., generalize) across contexts (Aguinis, Culpepper, et al., 2010; 
Berry et al., 2011; Culpepper & Davenport, 2009; Kobrin & 
Patterson, 2011; Passler, Beinicke, & Hell, 2014). For example, 
these potential explanations include (a) stereotype threat (Brown & 
Day, 2006; Sackett, Hardison, & Cullen, 2004; Steele & Aronson, 
1995; Walton, Murphy, & Ryan, 2015; Walton & Spencer, 2009); 
(b) lack of a common cultural frame of reference and identity 
across groups (Gould, 1999; Ogbu, 1993); (c) lack of a common 
framework for understanding and interpreting tests and the testing 
context (Grubb & Ollendick, 1986); (d) leniency effects favoring 
one group over another (Berry et al., 2013); (e) differential recruit- 
ing, mentoring, and retention interventions across groups (Berry et 
al., 2013); and (f) differential course difficulty across groups 
(Berry & Sackett, 2009). Given these factors, it seems unlikely that 
differential prediction would generalize across contexts and insti- 
tutions. However, the possible presence of heterogeneity is an 
issue that has not been assessed systematically. For example, 


"In addition to sampling error, measurement error, and range restriction, 
Hunter and Schmidt (2004) and others (Aguinis, Pierce, & Culpepper, 
2009) have identified additional factors that increase the variance of 
validity coefficients across studies. These factors include scale coarseness, 
imperfect construct validity in the predictor and/or criterion variables, 
computational and other errors in data, and artificial dichotomization of 
continuous variables. 
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although Linn (1973) described differences in the extent of differ- 
ential prediction across the 22 institutions included in his study, it 
is unclear the extent to which such variability was substantive in 
nature or because of methodological and statistical artifacts. 

In sum, our study introduces the new concept of differential 
prediction generalization and investigates the potential presence of 
variability in ethnicity and gender-based differential prediction 
across contexts. We do so using data predicting first-year college 
grade point average (GPA) from SAT scores and high-school 
GPA. 


Method 


Data Collection Procedures and Participants 


We obtained the raw data from Mattern and Patterson’s (2013) 
Appendixes A-F, which include tables in a 384-page PDF docu- 
ment available at http://dx.doi.org/10.1037/a0030610.supp. We 
exported the data from these tables to Microsoft Excel using 
Able2Extract Pro 7.0 and SomePDF 1.0. Additional details regard- 
ing the data extraction algorithms and procedures are available 
from the authors upon request. 

The tables include variance-covariance matrices involving rela- 
tions among SAT scores, first-year college GPA, high-school 
grade point average (HSGPA), and demographic variables (i.e., 
ethnicity and sex) for 176 colleges and universities (i.e., 348 
unique cohorts). Specifically, these include participating colleges 
and universities that provided the College Board with GPA and 
these data were matched to College Board databases that include 
SAT scores and responses to the SAT questionnaire, which in- 
cluded self-reported HSGPA and demographic information. The 
data were collected by the College Board as part of a multiyear 
study between 2006 and 2008. Identical to Mattern and Patterson 
(2013), we treated each cohort (henceforth referred to as a “sam- 
ple”) as an individual data point. Sixty-one out of 339 (ie., 
17.99%), 48 out of 247 (i.e., 19.43%), and 50 out of 264 (ie., 
18.93%) institutions provided three samples for the female—male 
(FM), Black-White (BW), and Hispanic-White (HW) compari- 
sons, respectively. Thus, the contribution of three samples by 
institutions is only a small portion of the total, which reduces the 
likelihood that dependency due to cohorts nested within institu- 
tions may have biased our results. To more formally assess the 
possibility of dependence in the data structure, we examined 
the variance attributed to cohorts nested within institutions and the 
result was only .4% of the total variability. In other words, this 
small amount of variance suggests that it is appropriate to treat 
each sample as an individual data point in our analyses because 
data dependence did not bias standard error estimates (Aguinis & 
Culpepper, 2015; Aguinis, Gottfredson, & Culpepper, 2013; 
Raudenbush & Bryk, 2002). 

Mattern and Patterson (2013) reported that the institutions were 
diverse in terms of geographic region, public/private, size, and 
selectivity. In addition, Mattern and Patterson (2013) reported 
removing samples with fewer than 15 individuals in any of the 
ethnicity- or gender-based subgroups from their analyses. Accord- 
ingly, FM comparisons were made based on approximately 
257,336 women and 220,433 men across 339 samples. BW com- 
parisons were based on 29,734 Black and 304,372 White students 
across 247 samples. For the WH comparisons, analyses were based 
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on 35,681 Hispanic and 308,818 White students across 264 sam- 
ples. 


Differential Prediction Analysis 


Assessing the presence of differential prediction involves esti- 
mating the following three models (American Educational Re- 
search Association, American Psychological Association, and Na- 
tional Council on Measurement in Education, 2014; Cleary, 1968; 
Society for Industrial and Organizational Psychology, 2003): 


GPA = By) + B,; HSGPA + B,SAT-CR + B3SAT-M 
+ B,SAT-W +e (1) 


GPA = By + B;HSGPA + B,SAT-CR + B3SAT-M 
a B4SAT-W =i B;G +e (2) 


GPA = By) + B,}HSGPA + B,SAT-CR + B3SAT-M 
+ B,SAT-W + B;G + Bg6HSGPA - G 
+ B,SAT-CR - G+ BgSAT-M - G 
+ BoSAT-W:-G +e (3) 


Equation 1 includes the criterion GPA regressed on the predic- 
tors HSGPA, SAT-CR (SAT-Critical Reading), SAT-M (SAT- 
Math), and SAT-W (SAT-Writing). The model in Equation 2 
differs from Equation | in that it includes a dummy variable G, 
which has two categories and is used to assess the FM, BW, or HW 
comparisons. The model in Equation 3 includes product terms that 
capture interaction effects on GPA (i.e., moderating effect of 
ethnicity and gender on the relation between the predictors and 
GPA) and can be written in matrix notation as follows: 


yj = XB; +e; @) 


where, for sample j, y; is an; dimensional vector of criterion scores 
(i.e., n; is the size for sample j), X; is an; X q matrix of predictor 
variables (i.e., g = 9 for Equation 3), B; is a qg dimensional vector 
of regression coefficients, and e; is a n, dimensional vector of 
errors. The goal of differential prediction analysis is to examine 
whether test scores differentially predict criteria for different 
groups by examining whether coefficients within B, (i.e., Bs, Be, 
B,, Bg, and By in Equation 3) are different from zero. Specifically, 
a nonzero regression coefficient associated with predictor G sug- 
gests the presence of intercept-based differential prediction and 
nonzero coefficients associated with the product terms suggests the 
presence of slope-based differential prediction. 


Differential Prediction Generalization Analysis 


We used multivariate meta-analytic regression modeling 
(MMA) to synthesize regression coefficients and assess the degree 
of variability in differential prediction across samples as described 
by Becker and Wu (2007) and Chen, Manning, and Dupuis (2012). 
The MMA procedure uses data from each sample (i.e., b; and 
Cov[b,|X,]) to estimate a meta-analyzed mean, in addition to 
cross-sample variance components. Specifically, the random ef- 
fects MMA model described by Chen et al. (2012) includes the 
following equation for b;: 
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where W = [I,, W,] is ag X (q + p) block design matrix that 
includes a q dimensional identity matrix and a g X p matrix of 
sample-level variables to explain differences in b;. Furthermore, 8, 
is a vector of random effects for sample j defined as 6, ~ N, (0,, 
T) where T is a g X q between sample variance-covariance matrix 
that quantifies the amount of heterogeneity that exists across 
samples above and beyond sampling error (i.e., e,, which is an 
error with a multivariate normal distribution; Chen et al., 2012). 

Methodological and statistical artifacts. The goal of differ- 
ential prediction generalization analysis is to quantify the variabil- 
ity in differential prediction across samples. However, sampling 
error, range restriction, and measurement error are three factors 
that should be ruled out given that they usually account for the 
largest proportion of observed variance (Aguinis, 2001; Hunter & 
Schmidt, 2004). In fact, Schmidt and Hunter (1981) estimated that 
an average of 72% of the variance of validity coefficients observed 
across studies is the result of these artifacts and, moreover, sam- 
pling error alone accounts for 85% of the variance accounted for 
by artifacts. Accordingly, in our study, W, includes sample size 
(i.e., to account for sampling error). 

In addition to sampling error, range restriction can increase or 
decrease observed variability in relation to true variability (Mur- 
phy, 1993). Accordingly, as noted by Linn (1983), “it is essential 
that selection effects be considered if our correlational and regres- 
sion analysis results are to be properly interpretable” (p. 13). 
Range restriction is pervasive in college admissions testing be- 
cause the data examined include only those students who have 
been admitted and for whom GPA information is subsequently 
available. The standard corrections for range restriction require 
three assumptions: linearity between predictors and criterion, con- 
stant residual error variance, and criterion scores missing at ran- 
dom (MAR) (Mendoza, 1993; Mendoza, Bard, Mumford, & Ang, 
2004). Under these assumptions, commonly employed corrections 
such as Lawley’s multivariate correction (Birnbaum, Paulson, & 
Andrews, 1950; Lawley, 1944) yield unbiased estimates of popu- 
lation correlation coefficients.* Furthermore, simulation studies 
support the accuracy of the Lawley correction across different 
sample sizes, magnitude of predictor intercorrelations, and degree 
of selectivity (Muthén & Hsu, 1993; Sackett & Yang, 2000). 

A relevant issue pertaining to our study is that if the MAR and 
linearity assumptions are satisfied, the restricted regression coef- 
ficients (1.e., estimates in the selected sample) equal the estimated 
unrestricted coefficients. Stated differently, if these assumptions 
are met, range restriction does not bias estimates of B,, and the 
least squares estimator for the restricted sample is identical to the 
estimator corrected for range restriction. For example, consider 
Lawley’s procedure and let S,,; denote a g X q variance- 
covariance matrix among the predictors (i.e., covariances among 
the predictors in Equation 3) and S,,,; be a q dimensional vector of 
covariances between the predictors in Equation 3 and GPA in the 
jth sample. If there is no range restriction, the g dimensional vector 
of coefficients for sample j in Equation 3 are estimated as 
b; = SxjSxy;- However, S,,; and S,,; differ from values in the 
unrestricted applicant pool and, similar to Mattern and Patterson 
(2013), researchers employ Lawley’s correction, which uses sam- 
ple j’s g X q predictor variance-covariance matrix xj from the 
applicant pool. This information is available because Mattern and 


XXj 
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Patterson reported S,,; and also X,,; for all students in the appli- 
cant pool. The g dimensional vector of range restriction corrected 
coefficiehts are defined as 


by = Sad Say (6) 


where the Lawley correction defines eu = YyjSixjSayj- AS ex- 
pected, the restricted coefficients equal the unrestricted coeffi- 
cients. Specifically, b; = ee. = Dix} 2axjSaxjSayj = b,, $0 that 
b; = b;, if the MAR and linearity assumptions are satisfied. 

The prior discussion shows that the restricted regression coef- 
ficients equal the corrected coefficients when the MAR and lin- 
earity assumptions are satisfied. In contrast, the restricted standard 
errors are too small, which implies that inferences for regression 
coefficients 8; are incorrect (Aguinis & Stone-Romero, 1997; , 
Culpepper, 2012b). Consequently, it is necessary to correct the 
sample standard deviation of GPA for range restriction to obtain a 
corrected covariance matrix of b,. Let s} be sample j’s variance of 
college grades. Lawley’s corrected variance 6 is estimated as 


G7 = 5} — Sh Soa (1y — ZxxiSax) Seyi ) 


where T indicates a vector transpose and I, is a g dimensional 
identity matrix. If college grades were collected for all applicants, 
Cov(b; | X;) = es would be the variance-covariance matrix of 
b; in the applicant pool conditioned upon the predictor matrix X, 
with o” as the criterion variance in the applicant pool and N; as the 
number of applicants. However, college grades are collected for 
admitted and enrolled students only, so & must be used as an 
estimate of oF and n, is used rather than N,, which implies that an 
estimate for the range restriction corrected variance-covariance 
matrix of the b, for sample j is 


85) 
sep. 
Cov(b;| X;) = ay: (8) 
j 


In addition to sampling error and range restriction, measurement 
error in the criterion GPA also needs to be ruled out as a potential 
source of variability in differential prediction across samples. 
Criterion measurement error usually inflates observed variability 
of correlation coefficients across studies (Schmidt & Hunter, 
1977). This effect has been documented regarding correlation 
coefficients but Cohen, Cohen, West, and Aiken (2003, pp. 56-57) 
showed that bivariate regression coefficients are unaffected by 
criterion measurement error. Extending the work by Cohen et al. 
(2003), Supplemental File A available online provides new deri- 


* Although Mendoza (1993) argued that the MAR assumption is reason- 


able in the particular context of college admissions testing because 
decision-makers do not observe the missing criterion scores, the effects of 
violating the MAR, linearity, and homoscedasticity assumptions on differ- 
ential prediction generalization analysis are unknown and would depend on 
the nature of the missing data pattern, the nonlinear relationship (i.e., 
concave or convex), and the nonconstant error pattern (Culpepper, 2015). 
Mattern and Patterson (2013) did not report results regarding compliance 
with these assumptions and, in addition, their dataset did not include 
sufficient information for us to conduct this assessment. Specifically, 
complete student records would be needed to test for compliance with the 
linearity and homoscedasticity assumptions and additional information 
from admissions offices would be needed to assess compliance with the 
MAR assumption. Thus, additional data and research are needed to address 
these issues. 
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vations and proof that correcting the criterion for measurement 
error using classical test theory does not affect the observed 
variance of differential prediction across samples in the multiple 
predictor case. Hence, correcting criterion measurement error in 
GPA would not change estimates of differential prediction vari- 
ability. 

Another methodological artifact that could affect the degree of 
observed differential prediction variability across samples is dif- 
ferential predictor measurement error. Mattern and Patterson 
(2013) reported reliability information for the predictors across all 
samples: .82, .91, .91, and .89 for HSGPA, SAT-Critical Reading 
(SAT-CR), SAT-Math (SAT-M), and SAT-Writing (SAT-W), re- 
spectively. Differential prediction variability may be due, at least 
in part, to differences in predictor reliability across institutions 
(i.e., the same population parameter may take on different sample- 
based values depending on the local degree of measurement error). 
However, it is not possible to correct for the potential effects of 
differential reliability on differential prediction variability without 
sample-level reliability information. Nevertheless, reliability esti- 
mates for all predictors are .80 or higher which, as noted by Lance, 
Butts, and Michels (2006), “appears to be Nunnally’s (1978) 
recommended reliability standard for the majority of purposes 
cited in organizational research” (p. 206). Accordingly, it is un- 
likely that differential predictor reliability would be so large as to 
completely eliminate all differential prediction variability if it 
exists. Nevertheless, if the College Board makes these data avail- 
able in the future, analyses considering sample-level measurement 
error will be possible. 

Finally, there are additional factors that could account for ob- 
served variability in differential prediction across samples. Specif- 
ically, some of these factors include unequal number of test takers 
across groups (i.e., women vs. men, Blacks vs. Whites, Hispanics 
vs. Whites); subgroup mean differences regarding the predictors 
SAT-CR, SAT-M, SAT-W, and high-school GPA; and standard 
deviations (SDs) for the predictors (as suggested by Linn, 1983). 
Thus, we included each of these factors in our study. 

Quantifying differential prediction variability. To quantify 
the degree of differential prediction variability across samples, we 
conducted a formal test using Cochran’ Q statistic. Q is a statistic 
for evaluating the degree to which regression coefficients differ 
across samples and is computed by summing the squared devia- 
tions of each study’s regression coefficient estimate from the 
overall meta-analytic estimate and weighting each study’s contri- 
bution by its sample size. Hence, a statistically significant Q 
suggests the presence of heterogeneity beyond what is expected by 
chance (Aguinis & Pierce, 1998; Aguinis et al., 2008). In addition, 
we also conducted a variance decomposition analysis and report 
the percent of cross-sample variance that remains after sampling 
error; range restriction; proportion of test takers across ethnicity- 
and gender-based subgroups, subgroup mean differences on the 
predictors (i.e., SAT-CR, SAT-M, SAT-W, and HSGPA); and SDs 
for the predictors have been accounted for as possible sources of 
variance. 

Implementing differential prediction generalization 
analysis. We conducted the following steps. First, we computed 
unstandardized regression coefficients, b;, from Equation 3 for 
each institution. Then, we corrected the variance-covariance ma- 
trix for b; for range restriction using Equations 7 and 8. We 
implemented the MMA procedure as in Equation 5 for two models. 
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Model 1 used b; and Cov(b; | X;) as discussed earlier as input for 
the MMA procedure. For Model 1 there were no sample-level 
variables included (i.e, W = J, and W, = 0). Model 2 extended 
Model 1 by including the following sample-level predictors into 
W;: inverse of sample size, proportion of test takers in reference 
group, subgroup mean differences regarding predictors (i.e., three 
SAT tests and HSGPA), and sample-level SDs for the four pre- 
dictors. In the Results section, S; refers to the standard deviation 
of unstandardized regression coefficients from the meta-analyzed 
mean coefficients. Furthermore, we also estimated S,, which de- 
notes the estimated SD of random effects (8; in Equation 5 for 
Model 2). We implemented the differential prediction generaliza- 
tion analysis with R (R Core Team, 2014) using the mvmeta 
(Gasparrini, Armstrong, & Kenward, 2012) and mvtmeta (Chen, 
2012) packages. 


Similarities and Differences in Data-Analytic 
Approach Between Mattern and Patterson (2013) 
and Present Study 


We implemented the same range restriction correction as Mat- 
tern and Patterson that was described previously. However, there is 
an important difference between the data-analytic approach em- 
ployed by Mattern and Patterson compared with our study. Spe- 
cifically, our study implemented a novel differential prediction 
generalization analysis based on the multivariate meta-analytic 
regression modeling approach recommended by Becker and Wu 
(2007), who provided a detailed discussion concerning the merits 
of different approaches for meta-analyzing regression coefficients. 
We followed their recommendation because this approach consid- 
ers the size of each sample explicitly and the effects of other 
factors (i.e., range restriction; proportion of students across 
ethnicity- and gender-based subgroups; subgroup mean differences 
for the predictors HSGPA, SAT-CR, SAT-M, and SAT-W; and 
SDs for the predictors) and, therefore, allows us to understand the 
extent to which observed variability in differential prediction is 
substantive or because of methodological and statistical artifacts. 


Results 


Corroboration of Mattern and Patterson 
(2013) Results 


We first attempted to corroborate Mattern and Patterson’s re- 
sults based on multiple regression correlations (i.e., square root of 
R?) for models with different subsets of the predictors and different 
types of corrections. This corroboration was necessary prior to our 
substantive analysis assessing differential prediction generaliza- 
tion to confirm the integrity of the database and that our differen- 
tial prediction assessment procedure is identical to the one imple- 
mented by Mattern and Patterson. 

Table 1 includes the multiple correlations reported by Mattern 
and Patterson (2013) based on observed (i.e., uncorrected) scores 
(Rops)> Multiple correlation based on models using Lawley’s cor- 
rection for predictor and criterion range restriction (Rep), multiple 
correlation based on models correcting for predictor and criterion 
range restriction and criterion measurement error (Reryp), and 
multiple correlation based on models correcting for predictor and 
criterion range restriction and predictor and criterion measurement 
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DIFFERENTIAL PREDICTION GENERALIZATION 


error using an errors-in-variables model (i.e., p; Culpepper, 2012a; 
Culpepper & Aguinis, 2011).? Table 1 includes several types of 
hierarchical regressions and all analyses are based on centered 
continuous predictors. For example, “T’, “A” under “MP Table 2” 
corresponds to the FM comparison in Mattern and Patterson for a 
model that only includes SAT scores. In contrast, pies C:s1sea 
model that includes SAT and HSGPA variables, a gender reference 
variable, and all product terms between the categorical and con- 
tinuous variables. Results shown in Table 1 indicate that the 
corroborated results are within minimal rounding error of Mattern 
and Patterson’s results at each stage and after the implementation 
of each type of correction. Consequently, Table 1 provides evi- 
dence that the data, equations, and procedures we used to assess 
differential prediction are identical to those used by Mattern and 
Patterson. 

Despite our ability to reproduce results, we found a few dis- 
crepancies that are likely typographical errors in Mattern and 
Patterson (2013) for the model including range restriction and 
criterion measurement error correction. In fact, we detected this 
same inconsistency in the Mattern and Patterson article for the FM, 
BW, and HW comparisons, which is highly improbable given that 
correcting for range restriction should lead to multiple correlation 
coefficients that are different from those based on observed data 
(e.g., Berry et al., 2013). In short, the only difference between our 
results and Mattern and Patterson’s is that they may have mistak- 
enly repeated the label “none” and copied the incorrect results in 
their Table 3 This discrepancy does not affect the differential 
prediction generalization results and conclusions reported herein 
because our analyses are based on their data and not results they 
reported in their Table 3 


Differential Prediction Analysis 


Table 2 reports range restriction corrected differential prediction 
results for the FM, BW, and HW comparisons (i.e., results from 
Model 1). Specifically, the EST column shows average (i.e., meta- 
analyzed) coefficients across samples. Results for the coefficients 
in Table 2 indicate small differences for the simple slope coeffi- 
cients for the SAT subtests for the BW and HW comparisons. 
Also, coefficients reported in Table 2 provide evidence that the 
SAT-CR and SAT-M tests were more strongly related to college 
GPA for women in comparison to men. Table 2 also provides 
evidence of subgroup differences in intercepts across the three 
subgroup comparisons, as has been shown in the past. That is, 
women scored, on average, 0.15 grade points higher than men 
whereas Blacks and Hispanics earned GPAs that were, on average, 
0.19 and 0.10 points lower than Whites, respectively. These re- 
sults, which represent the average degree of differential prediction 
for slopes and intercepts across samples for the FM, BW, and HW 
comparisons are consistent with previous studies (e.g., Fischer et 
al., 2013; Mattern & Patterson, 2013). 


Graphic Representation of Differential Prediction 
Across Samples 


Prior to conducting differential prediction generalization analy- 
sis, we calculated differences in predicted GPA values, symbolized 
by AY , for the FM, BW, and HW comparisons and present results 
in Figure 1. This figure offers a visual display of the variability of 
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differential prediction across samples and plots the individual lines 
for each sample to provide a graphical representation of the re- 
gression coefficients that were modeled in the metaregression 
procedure (i.e., coefficients prior to corrections). In calculating 
values for AY for each predictor, the other predictor scores are 
assumed to be equal to their means and we plotted AY between —2 
and 1.5 SDs around the predictor average. Thus, for example, for 
SAT-M, AY = AB, + AB,SAT-M. The panels in Figure 1 include 
not only the aggregated degree of differential prediction across all 
samples (i.e., central tendency), but also the individual lines for 
each sample to provide an indication of dispersion across samples. 

Figure 1 shows variability in subgroup prediction line differ- 
ences prior to adjusting for statistical and methodological artifacts. 
Furthermore, Figure | shows that the direction of slope differences 
varies and that there are many samples for which GPA is either 
over- or underpredicted by as much as 0.25 on a 0 to 4.0 grade 
point scale and, in some cases, by 0.50 in the tails of predictor 
score distributions. 

For pedagogical and illustrative purposes, Figure 2 plots the 
difference between predicted GPA values across subgroups, sym- 
bolized by AY, for four prototypal scenarios to aid the interpreta- 
tion of various types of differential prediction based on intercept 
and slope differences. Similar to Figure 1, Figure 2 plots AY prior 
to corrections for sample-level variables. Also similar to results 
plotted in Figure 1, for a given standardized predictor, z (i.e., 
HSPGA or SAT tests), AY = ABo + AB,z where AB, and AB, are 
intercept and slope differences, respectively, between the reference 
group coded as 0 (i.e., White, male) and the comparison group 
coded as | (i.e., ethnic minority, female). These illustrations are 
not average in terms of the amount and direction of differential 
prediction but, rather, exemplary for a considerable amount of 
samples. Also, to make comparisons easier, we used the same axis 
scales as in Figure 1. 

First, consider Institution #61 in 2006 for the BW SAT-CR 
comparison, for which subgroup prediction equations are nearly 
equivalent (i.e., ABy = 0.006 and ABs,7~cr= 0.000). The AY 
plot for Institution #61 is similar to a horizontal line with AY = 
0 for all values of z. Consequently, the plot for this institution is 
representative of those that include subgroups with similar inter- 
cepts and slopes. Next, consider Institution #136 in 2007 for the 
HW SAT-M comparison, for which the Hispanic intercept is 
approximately 0.20 units smaller than the White group (e., 
ABy = —0.198 and ABs,7-= 0). The AY plot for Institution 
#136 is horizontal, which indicates the absence of subgroup slope 
differences; however, AY is vertically shifted to the point where 
AY = — 0.198 . In contrast, Institution #169 in 2008 for the HW 
SAT-W comparison includes subgroups that differ in slopes, but 
not intercepts where AB) = 0.014 and ABy,7_w= 0.004. The 
extent to which institutions differ in slopes can be identified by the 
degree to which AY deviates from a horizontal line. For instance, 
Institution #169 has a AY plot with a positive slope that passes 


3 Corrections for range restriction and criterion measurement error affect 
R? values but, as noted above, they do not alter estimates of regression 
coefficients. The difference in R* values between uncorrected and cor- 
rected models is because of the fact that the Lawley procedure corrects the 
criterion variance and the correction for criterion measurement error di- 
vides the uncorrected R?s by the root of the criterion reliability coefficient. 
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Table 2 

Range Restriction Corrected Results of Differential Prediction Analysis for Female—Male, Black-White, and Hispanic—White 
Comparisons Using Meta-Analytic Regression Modeling 





Female—Male Black—White Hispanic—White 





Variable EST SE Significance Sp EST SE Significance Sp EST SE Significance S, 
HSGPA 4394  .0066 Ba 1073 4635 .0079 ee alIS3 4548  .0076 ig 1130 
SAT-CR 0005 —.0000 a .0003 .0004 .0000 i. .0003 0005  .0000 a .0003 
SAT-M 0008 — .0000 ae .0004 0004 .0000 me .0003 .0004 .0000 ean 0004 
SAT-W 0012 .0000 ag .0002 0014  .0000 uF .0003 0014  .0000 ee -0003 
Reference 1587, 37-0035 hi 0521, = 11883 0078 an 0919 —.1043 .0063 a .0740 
HSGPA * Reference —.0511 .0052 ce 0605 —.1388 .0114 rae 1400 —.0818 .0106 as 1247 
SAT-CR * Reference 0002  .0000 it 0002 0000 .0001 0008 0000 .0001 0006 
SAT-M * Reference .0003 = .0000 ae .0003 0001  .0001 0007 0001 .0001 .0007 
SAT-W * Reference —.0001  .0000 0002 —.0001 .0001 0009 —.0001 .0001 -0008 


ee I NEE aE ae a 
Note. EST = fixed-effects coefficients; S, = standard deviation of random effects (6, in Equation 5). Criterion for all models: first-year college grade 
point average (GPA). Predictors: HSGPA: High school grade point average, SAT-CR: SAT Critical Reading, SAT-M: SAT Math, SAT-W: SAT Writing. 
Reference: Dummy variable representing subgroups and coded as 1 for women and 0 for men (female—male comparison), 1 for Black and 0 for White 


(Black-White comparison), and 1 for Hispanic and 0 for White (Hispanic-White comparison). 


a5 = AUNTS 


through the (0,0) point. Furthermore, we see that intercept differ- 
ences are zero in Institution #169 by noting that the value of AY 
when z = 0 is zero. The fourth scenario, which refers to Institution 
#103 in 2007 for the BW SAT-M comparison, shows groups that 
differ in intercepts and slopes. For Institution #103, Blacks have a 





























Figure J. Variability in differential prediction across 348 samples of 
students in 176 colleges and universities. AY scores show differences 
between predicted first-year grade point average (GPA) scores across 
ethnicity- and gender-based subgroups based on models with scores cor- 
rected for range restriction. SAT-CR: SAT Critical Reading, SAT-M: SAT 
Math, SAT-W: SAT Writing, HSGPA: high school grade point average. 
The coloring indicates number of samples that overlap in subgroup pre- 
diction equation differences. FM: female versus male, BW: Black versus 
White, and HW: Hispanic versus White comparisons. The x-axes show 
predictor scores (i.e., HSGPA, SAT-CR, SAT-M, and SAT-W) and the 
x- and y-axes show scores in SD units. 


smaller intercept (AB) = —0.505) and slope (AB, = —0.003). 
Figure 2 shows that, for institution #103, AY is a downward 
sloping line indicating negative group differences in intercepts and 
slopes. 


Pervasiveness of Differential Prediction 
Across Samples 


Figure 2 includes actual yet illustrative scenarios only. Accord- 
ingly, Table 3 includes more comprehensive information regarding 
the pervasiveness of differential prediction across samples. Spe- 
cifically, Table 3 shows the percent of samples with intercept and 
slope differences different from zero for the three subgroup com- 
parisons. We did not implement Bonferroni-type corrections to 
minimize a possible Type I error inflation because product terms 
capturing the interactions are correlated and such correction would 
result in overly conservative tests given the known insufficient 
statistical power in differential prediction analysis (Aguinis, Cul- 
pepper, et al., 2010; Bobko & Russell, 1994; Cronbach, 1987; 
McClelland & Judd, 1993). Moreover, as noted by Mattern and 
Patterson (2013), “Although the overall sample size was quite 
large, the average sample size per study was substantially smaller” 
(p. 142). Specifically, the average subgroup sample sizes were 
approximately 120 for African Americans and 135 for Hispanics, 
which are not uncommonly large (e.g., Aguinis & Stone-Romero, 
1997). Much larger sample sizes are needed to achieve satisfactory 
statistical power (Aguinis, 2004a; Aguinis, Boik, & Pierce, 2001). 

Table 3 shows that gender-based (i.e., FM) differential predic- 
tion occurred for slopes in 8.3%, 16.2%, and 4.1% of the samples 
for SAT-CR, SAT-M, and SAT-W, respectively. Considering re- 
sults for the SAT-M, given that the FM comparison was based on 
a total of 477,769 students, approximately 77,399 (i.e., 16.2% of 
the total) attended an institution where SAT-M differentially pre- 
dicted first-year college grades based upon gender. Black-White 
differences for slopes for HSGPA, SAT-CR, SAT-M, and SAT-W 
occurred in 39.7%, 19.4%, 13.4%, and 16.2% of the samples, 
which amounts to approximately 132,640, 64,817, 44,770, and 
54,125 students out of a total of 334,106, respectively. In addition, 


DIFFERENTIAL PREDICTION GENERALIZATION 


talk eee "ABo = 0.006, AB 0 
1, DW, Abo = 0.006, ABsatcr = 
soommmer 136, HW, ABy = -0.198, ABsat-m = 0 
Se: 169, HW, A 0 = 014, ABsar.w = 0.004 
—— 103, BW, ABp = -0.505, ABsat.yy = -0.003 





-2 -] 0 1 Z 
Standardized Predictor 


Figure 2. Prototypical scenarios based on actual samples showing no dif- 
ferential prediction and three forms of differential prediction. Institution #61: no 
differential prediction, Institution #136: differential prediction based on 
intercepts but not slopes, Institution #169: differential prediction based on 
intercepts and slopes, Institution #103: differential prediction based 
on slopes but not intercepts, AY: subgroup-based differences in predicted 
criterion value (i.e., first-year college grade point average [GPA]), ABy = 
subgroup-based differences in intercepts, and AB, = subgroup-based dif- 
ferences in slopes. SAT-CR: SAT Critical Reading, SAT-M: SAT Math, 
SAT-W: SAT Writing. FM: female versus male, BW: Black versus White, 
and HW: Hispanic versus White comparisons. x- and y-axes show scores in 
SD units. 


there were HW differences for HSGPA and the SAT subtests in 
25.0%, 13.3%, 18.9%, and 15.5% of the samples, respectively, 
which suggests that approximately 86,125, 45,818, 65,110, and 
53,397 students attended institutions where there is Hispanic— 
White differential prediction (out of a total of 344,499 students). 
Finally, Table 3 shows that differential prediction based on inter- 
cepts is even more pervasive: 80.8%, 61.9%, and 41.3% of sam- 
ples for the FM, BW, and HW comparisons, respectively. In other 
words, there is differential prediction for the vast majority of 
samples for the FM comparison, for more than half for the BW 
comparison, and for just under half for the HW comparison. 


Differential Prediction Generalization Analysis 


Going beyond the reporting of the average degree of differential 
prediction across samples, Table 2 also includes the square root of 
the estimated SD of random effects for the nine regression coef- 
ficients for the three comparisons (i.e., the column labeled as 
“§,”). S, quantifies the extent of systematic differences in differ- 
ential prediction across samples. 

To assess the degree of differential prediction variability across 
samples, Table 4 includes results of a formal test pertaining to 
differential prediction generalization using Cochran’s Q statistic. 
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Recall that a statistically significant Q test suggests the presence of 
heterogeneity beyond what is expected by chance (Aguinis & 
Pierce, 1998; Aguinis, Sturman, & Pierce, 2008). Table 4 includes 
results for Model 1, which includes the nine predictor variables 
(e., five first-order effects and four product terms), and for Model 
2, which includes Model 1 and the following additional sample- 
level predictors: inverse of sample size (to account for sampling 
error), proportion of test takers in reference group (i.e., to account 
for differences in the size of samples across ethnicity- and gender- 
based subgroups), subgroup mean differences regarding predictors 
(i.e., three SAT tests and HSGPA), and sample-level SDs for the 
four predictors. Results in Table 4 show that 13 out of the 15 Q 
tests are statistically significant. The only two statistically nonsig- 
nificant tests were the FM comparison for the SAT-W and 
SAT-CR tests. In other words, results in Table 4 indicate that (a) 
differential prediction based on HSGPA, SAT-CR, SAT-M, and 
SAT-W does not generalize for the BW and HW comparisons; (b) 
differential prediction based on HSGPA and SAT-M does not 
generalize for the FM comparison, and (c) there is differential 
prediction generalization based on the SAT-CR and SAT-W for 
the FM comparison. 

In addition to QO statistics, Table 4’s column labeled % shows the 
percent of variance in coefficients across samples that remains 
after accounting for methodological and statistical artifacts (i.e., 
variance decomposition based on S, values from Model 2). More 
precisely, the rows for “Reference” show the percent of intercept- 
based differential prediction variance across samples remaining 
after accounting for methodological and statistical artifacts and the 
rows pertaining to two-way interactions show the percent of slope- 
based differential prediction variance across samples remaining 
after accounting for methodological and statistical artifacts. These 
results offer additional information about the extent of variability 
(i.e., degree of lack of generalization) for each test and subgroup 
comparison. Lack of differential prediction generalization was 
greatest for HSGPA for the BW comparison (about 34% of vari- 
ance in coefficients across samples remains after methodological 
and statistical artifacts are taken into account), followed by the 
intercept for the BW and HW comparisons (about 29% of variance 
remaining for each), HSGPA for the HW comparison (about 28% 
of variance remaining), SAT-W for the BW comparison (about 
20% of variance remaining), SAT-M for the HW comparison 
(about 19% of variance remaining), and the intercept for the FM 
comparison (also about 19% of variance remaining). Alternatively, 
for the SAT-W, only about 3% of variance in differential predic- 
tion across samples remains after artifacts are taken into account 
for the FM comparison. 


Discussion 


Our results reveal that the conclusion that “findings indicated 
that the use of SAT and HSGPA results in minimal differential 
prediction” (Mattern & Patterson, 2013, p. 146) is only reached 
when we examine summary statistics collapsing across the 348 
samples collected from the 176 colleges and universities. In con- 
trast, differential prediction generalization analysis suggests that 
there is substantial variability in differential prediction across 
samples. In fact, subgroup differences in intercepts and slopes are 
quite large for many colleges and universities and sample-level 
variability remains after accounting for sampling error and other 
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Table 3 


Pervasiveness of Range Restriction Corrected Differential Prediction Based on Intercepts and Slopes for Female—Male, Black-White, 


and Hispanic—White Comparisons 


\ 





Female—Male (339 samples; 
477,769 students) 


Black-White (247 samples; 


Hispanic—White (264 samples; 


334,106 students) 344,499 students) 





Variable % N % = TPA % N % = TPA % N % = TPA 
HSGPA .976 475,287 EOL 332,443 .989 343,627 
SAT-CR .295 220,203 6397, 175,566 394 183,023 
SAT-M 490 344,011 360 161,682 356 170,312 
SAT-W 605 399,730 834 317,429 837 328,262 
Reference .808 450,604 619 259,959 413 194,088 
HSGPA * Reference 224 143,715 .024 sou, 157,899 .093 .250 S715 080 
SAT-CR * Reference .083 60,395 425 194 68,264 . 24 133 955167) .667 
SAT-M * Reference 162 91,421 345 134 52,285 .700 189 60,086 659 
SAT-W * Reference 041 15,982 192 162 69,955 356 ALS _ 94,732 D2 


Note. Criterion for all models: first-year college grade point average (GPA). Predictors: HSGPA = high school grade point average; SAT-CR = SAT 
Critical Reading; SAT-M = SAT Math; SAT-W = SAT Writing. Reference: Dummy variable representing subgroups and coded as 1 for women and 0 
for men (Female—Male comparison), 1 for Black and 0 for White (Black-White comparison), and 1 for Hispanic and 0 for White (Hispanic-White 
comparison). % = percentage of samples showing individual regression coefficients different from zero (p < .05); N = number of students based on 
summing samples sizes of samples with coefficients different from zero; % = TPA = percent of samples with a differential prediction effect as large as 
or larger than the test’s predictive ability (i.e., reference group slope) regardless of statistical significance. All values are computed using the model in 


Equation 3. 


methodological and statistical artifacts that could potentially in- 
flate observed differential prediction variability (i.e., range restric- 
tion, proportion of test takers across ethnicity- and gender-based 
subgroups, subgroup mean differences on the predictors, and SDs 
for the predictors). The finding regarding overall lack of differen- 
tial prediction generalization is new because past research has only 
provided evidence regarding validity generalization (i.e., Boldt, 
1986a, 1986b; Linn et al., 1981), but not regarding differential 
prediction generalization (or lack thereof). The Standards for 
Educational and Psychological Testing note that “validity refers to 
the degree to which evidence and theory support the interpretation 
of test scores for proposed uses of tests” (American Educational 
Research Association, American Psychological Association, and 
National Council on Measurement in Education, 2014, p. 11). 
Accordingly, the result regarding overall lack of differential pre- 
diction generalization also has implications for validity because 
knowledge that differential prediction does not generalize requires 
interpretations of test scores within local contexts. 


Implications for Theory and Future Research 


Aggregating results based on samples for which there is over 
prediction for one subgroup and samples for which there is under 
prediction for the same subgroup leads to the conclusion that, 
across samples, differential prediction is virtually nonexistent. The 
British writer and politician Benjamin Disraeli (1804—1881) stated 
the following (Huff, 1954): “A man eats a loaf of bread, and 
another man eats nothing; statistics is the science that tells us that 
each of these men ate half a loaf of bread.” The same issue of 
aggregation across heterogeneous units—samples of students from 
different colleges and universities in our particular case—explains 
why Mattern and Patterson’s results suggest that differential pre- 
diction is “minimal.” 

The variability in observed differential prediction across sam- 
ples is not explained fully by sampling error and other method- 
ological and statistical artifacts that have accounted for the major- 


ity of variance in validity coefficients across studies in past 
research. Specifically, the lack of differential prediction general- 
ization is not explained by criterion measurement error, range 
restriction, proportion of test takers in reference group, predictor 
SDs, and subgroup mean differences regarding predictors (i.e., 
SAT-CR, SAT-M, SAT-W, and high-school grade point average). 
For the FM comparison, HSGPA and SAT-M show the greatest 
lack of differential prediction generalization. For the BW compar- 
ison, HSGPA also shows the greatest lack of differential prediction 
generalization, followed by SAT-W, SAT-CR, and SAT-M. For 
the HW comparison, the greatest lack of differential prediction 
generalization was also observed for HSGPA, followed by 
SAT-M, SAT-W, and SAT-CR. 

Taken together, results suggest that, as is the case in many areas 
in educational and organizational research (Rousseau, 1978), con- 
text should play an important role in future college admissions 
testing research. In particular, future research can investigate 
cross-level interaction effects (Aguinis et al., 2013; Mathieu, Agui- 
nis, Culpepper, & Chen, 2012). Specifically, as mentioned in the 
Introduction, there are institution-level variables (i.e., Level 2 
moderators) that likely affect the relationship between individual- 
level test scores and performance (i.e., a relationship between a 
level-one predictor and a level-one criterion). For example, why is 
it that for some contexts and tests there are prediction differences 
in favor of Black students whereas for others the opposite is true? 
Mattern and Patterson (2013) took the first and unprecedented step 
to make a substantial amount of data available, but their data did not 
include information on substantive institution-level factors. We hope the 
College Board and other test vendors, not only of college admissions 
tests but also employee selection tests, will make institution-level 
data available so that future research will be able to answer this 
and other related critical questions. In other words, we currently do 
not know which institution-level factors cause differential predic- 
tion, and which particular form of differential prediction, across 
contexts. Given our results, there is a need for future research to 
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examine factors causing differential prediction to vary in magni- 
tude and direction across contexts. Results of this research will 
likely lead to effective actions and interventions. To guide future 
research, we offer a more detailed description of how and why 
each of the mechanisms we listed in the Introduction may serve as 
possible explanations for the presence of differential prediction 
and differential prediction variability across institutions. 

Stereotype threat. Stereotype threat is a situational phenom- 
enon that occurs when individuals believe they face the prospect of 
being evaluated as a function of, and confirming, a negative 
stereotype about a group to which they belong (Steele & Aronson, 
1995). According to Walton et al. (2015), standardized cognitive 
ability tests can induce stereotype threat among test takers who are 
members of underrepresented groups (e.g., women, members of 
ethnic minority groups). Referred to as the “latent-ability” hypoth- 
esis, stereotype threat can prevent such test takers from performing 
as well as they are capable; that is, some of their cognitive ability 
remains latent or hidden. Hence, test scores can. show systematic 
differential prediction such that they underestimate the ability and 
potential performance of individuals from negatively stereotyped 
groups (Walton & Spencer, 2009). Walton et al. (2015) concluded 
that stereotype threat can affect ethnic minorities’ scores on cog- 
nitive ability tests administered in evaluative settings (e.g., 
schools) and, thus, result in disproportionately negative effects on 
decisions regarding their selection. The magnitude of the effect of 
stereotype threat on differential prediction may, however, depend 
on the degree to which the threat affects predictor and criterion 
scores differentially across ethnicity-based subgroups (Brown & 
Day, 2006). In short, differential levels of stereotype threat are 
likely to lead to differential levels of differential prediction across 
institutions. 

Lack of common cultural frame of reference and identity 
across groups. Members of different ethnicity-based subgroups 
do not share a common cultural frame of reference and identity 
(Ogbu, 1993). For example, ethnic minority group members may 
interpret discrimination against them as permanent and institution- 
alized. This frame of reference develops over long periods of time 
as the result of perceived or actual exclusion, segregation, and 
barriers to opportunities. It can make some ethnic minority group 
members have lower expectations about the likelihood that obtain- 
ing good test scores will lead to desirable outcomes such as 
admission to college (Gould, 1999). Stated differently, cultural 
frames of reference affect how tests and testing situations are 
interpreted. Hence, ethnicity-based subgroups differ in their inter- 
pretation of the meaning of test scores and the relation between test 
scores and performance measures (Grubb & Ollendick, 1986). 
Such ethnicity-based differences in cultural frames likely differ 
across contexts and institutions and, therefore, are another factor 
that likely leads to differential levels of differential prediction. 

Leniency effects favoring one group over another. With 
respect to college students’ grades and their GPA, leniency effects 
can occur when graders apply a “shifting standards” model and 
assign some minority students higher grades than they deserve 
(Berry et al., 2013). The resulting error variance in some minority 
students’ GPA can affect the relation between cognitive ability test 
scores and GPA. Because this shifting of standards is unlikely to 
be homogenous across institutions, it is another contextual factor 
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T (i.e., percentage of variance in coefficients across samples remaining after accounting for range restriction), and tables including Model 2 coefficients and SEs are included in 
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Model 1 and the following sample-level predictors: inverse of sample size, proportion of test takers in reference group, subgroup mean differences regarding predictors (i.e., three SAT tests and HSGPA), 
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likely to create variability in the degree of differential prediction 
across institutions. 

Differential recruiting, mentoring, and retention interven- 
tions across groups. To meet affirmative action goals, many 
academic institutions make extra efforts to recruit, mentor, and 
retain ethnic minority students—this is also the case regarding 
women in fields in which they are underrepresented (e.g., STEM: 
science, technology, engineering, and math). These extra efforts 
could include using different admissions standards, offering extra 
tutoring, and providing counseling opportunities while in college 
(Berry et al., 2013). According to Berry et al. (2013), if institutions 
implement these efforts, then students’ admission into and success 
in college can be a function of factors other than their cognitive 
ability, which could reduce the relation between their cognitive 
ability test scores and GPA. Because such efforts clearly differ 
across institutions, it could also be a factor leading to different 
degrees of differential prediction. 

Differential course difficulty across groups. Finally, differ- 
ential prediction may also be explained, at least in part, by differ- 
ential course difficulty across gender- or ethnicity-based sub- 
groups. For example, Berry and Sackett (2009) determined that 
differential course difficulty may explain differences regarding 
GPA scores and, moreover, this phenomenon may lead to a de- 
crease in the resulting validity coefficient. Because differences in 
course difficulty are unlikely to be homogenous across institutions, 
it is also unlikely that the degree of differential prediction is 
homogeneous across institutions. 

More broadly, there are additional issues regarding the use of 
GPA as the criterion that may lead to differential prediction 
variability across institutions. For example, these include differ- 
ential course selection, drop-out rates, and institutional selection 
criteria at the local level, among others. As summarized by Berry 
and Sackett (2009), “College GPA certainly reflects academic 
performance to some degree, but there are also well-known 
sources of construct-irrelevant variance in GPA—particularly in- 
structors’ grading idiosyncrasies .. .” (p. 822). Hence, these and 
other idiosyncrasies associated with a student’s GPA, which are 
likely to vary across institutions, may account, at least in part, for 
the lack of differential prediction generalization found in our 
study. 


Implications for Practice 


Results regarding overall lack of differential prediction gener- 
alization imply that SAT scores and HSGPA seem to function 
differently across some subgroups and institutions in predicting 
first-year college GPA. These results have important implications 
for practice given that, since 2005, between 1.4 million and 1.6 
million students have taken the SAT annually, more than 1.66 
million students have done so in the class of 2012 (College Board, 
2013), and about 1.7 million students have taken it during the year 
2013 (Lewin, 2014). 

Results included in Table 3, and our earlier discussion, provide 
evidence regarding the pervasiveness of differential prediction. 
However, to gain a fuller understanding of practical significance, 
it is also important to consider the magnitude of the effect (Agui- 
nis, Werner, et al., 2010). Table 3 includes the percentage of 
samples with subgroup slope differences that exceed the magni- 
tude of the test’s predictive ability for the reference group (.e., 
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slopes between predictors and criterion). For instance, the refer- 
ence group slope for SAT-CR was smaller than the subgroup 
differences in 42.5% of FM, 72.1% of BW, and 66.7% of HW 
comparisons. In contrast, Table 3 shows that fewer than 10% of 
samples had reference group slopes for HSGPA that were less in 
magnitude than the subgroup difference. 

Although the aforementioned results regarding the prevalence 
and magnitude of differential prediction provide evidence regard- 
ing practical significance, results have important implications even 
if differential prediction were smaller and existed in only a handful 
of samples. The reason is that more than 1.5 million students and 
their families are affected annually by decisions based on students’ 
scores. Moreover, for a test taker whose GPA has been underpre- 
dicted for a desired college because of her ethnicity or his gender, 
it is no consolation that on average, and across institutions, differ- 
ential prediction is minimal. In short, our results regarding prac- 
tical significance show that differential prediction should be taken 
seriously and this is the reason why the Standards for Educational 
and Psychological Testing “emphasize that fairness to all individ- 
uals in the intended population of test takers is an overriding, 
foundational concern, and that common principles apply in re- 
sponding to test-taker characteristics that could interfere with the 
validity of test score interpretation” (American Educational Re- 
search Association, American Psychological Association, and Na- 
tional Council on Measurement in Education, 2014, p. 49). More- 
over, “a fair test does not advantage or disadvantage some 
individuals because of characteristics irrelevant to the intended 
construct . . . characteristics of all individuals in the intended 
population, including those associated with race, ethnicity, gender 

. must be considered throughout all stages of development, 
administration, scoring, interpretation, and use so that barriers to 
fair assessment can be reduced” (American Educational Research 
Association, American Psychological Association, and National 
Council on Measurement in Education, 2014, p. 50). 

Our results suggest that lack of differential prediction when 
using HSGPA and SAT tests cannot be assumed in making college 
admissions decisions. Depending on the institution and its local 
practices (e.g., admissions, grading, affirmative action policies), 
and various contextual and societal factors, it is possible that there 
may be differential prediction—and the form of such differential 
prediction is unlikely to be the same across samples. In terms of 
practice, institutions that rely on SAT and HSGPA for admissions 
and other types of decisions (e.g., scholarship allocations) would 
be well served by conducting a local differential prediction study 
to understand whether it exists and its nature. Only through an 
assessment of the presence of differential prediction together with 
future research aimed at understanding the reasons for various 
types of differential prediction will we be able to minimize it and, 
hopefully, eliminate it. Moreover, the finding that there is differ- 
ential prediction may call into question the use of a particular test 
in a particular institution. In short, sample-level variability is too 
substantial to rely on results that are aggregated across institutions 
for determining whether differential prediction exists at any one 
institution. 

One possibility in terms of practice would be to use a specific 
institution-based regression equation in making GPA predictions, 
but there are three important caveats. First, a local differential 
prediction study relies on data from one institution only and, 
consequently, sample size may be small. Accordingly, because of 
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a small sample and accompanying insufficient statistical power, 
such institution-based differential prediction analysis is likely to 
conclude that there is no differential prediction even if such 
differential prediction exists (Aguinis, Culpepper, et al., 2010). 
Thus, a power analysis is necessary before one can reach a con- 
clusion of no differential prediction with confidence (Aguinis, 
2004a). An additional recommendation is to use data from more 
than one cohort of students—particularly for the case of smaller 
institutions. But, such aggregation requires homogeneity of co- 
horts and contextual process that may account for differential 
prediction in a particular institution. Second, even if a local dif- 
ferential prediction study involves adequate statistical power, the 
resulting coefficients are influenced by statistical and methodolog- 
ical artifacts (e.g., sampling error, range restriction). Hence, they 
should be corrected so that the best estimates of population coef- 
ficients are used (Hunter & Schmidt, 2004). Third, one of the five 
anonymous reviewers included in the Journal of Educational 
Psychology review team that evaluated the original and subsequent 
nine revisions of our manuscript commented that the substantive 
factors we described as possible sources of differential prediction 
could be described as “institutional biases.” Hence, this reviewer 
noted that the recommendation about conducting a local 
institutional-level differential prediction analysis might legitimize 
these institutional biases. 

Regardless of whether an institution-level or other regression 
equation is used, a possible solution to address the existence of 
differential prediction would be to not use a common line and, 
instead, use different regression lines across subgroups. This prac- 
tice used to be fairly typical (Schmidt & Hunter, 2004), possibly 
reflecting practitioners’ belief regarding the existence of differen- 
tial prediction. However, with the passage of the Civil Rights Act 
of 1991, the legal defensibility of this within-group norming has 
come into question and, in fact, it is generally illegal without a 
consent decree (Aguinis, 2004b, Cascio & Aguinis, 2011). Thus, 
the current legal context in the United States highlights the ur- 
gency to conduct additional research involving academic- 
practitioner collaborations that will hopefully result in a greater 
understanding of why and how differential prediction occurs. 

Finally, our analyses involved an examination of differential 
prediction by assessing each individual predictor. We followed this 
approach because the goal of differential prediction analysis is to 
understand whether test score-performance relations vary across 
groups—for each test used in the decision making process (Amer- 
ican Educational Research Association, American Psychological 
Association, and National Council on Measurement in Education, 
2014C). However, as noted by an anonymous reviewer, “what if, 
for a single school, predictive bias is found for 1 or 2 predictors 
(e.g., SAT-CR and SAT-M), but not the other predictors such that 
when the total application score is computed, the bias from 
SAT-CR and SAT-M is virtually cancelled out?” Although this is 
a possibility in some cases, our position is that, based on profes- 
sional standards (i.e., American Educational Research Association, 
American Psychological Association, and National Council on 
Measurement in Education, 2014; Society for Industrial and Or- 
ganizational Psychology, 2003), the goal of differential prediction 
analysis is to understand the role of each test and, therefore, 
differential prediction analysis should be interpreted at the test 
level of analysis. 
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Limitations 


Our results and conclusions should be interpreted within the 
context of several limitations because of data unavailability that 
we mentioned earlier. Specifically, we were unable to assess the 
potential impact of violating the linearity, MAR, and constant 
variance assumptions. In addition, we were unable to assess the 
potential impact of bias in the criterion scores (i.e., GPA). Finally, 
we were unable to correct for the potential effects of differential 
reliability of predictors across samples. 


Conclusion 


Our introduction of the new concept called differential predic- 
tion generalization, which combines previous work on differential 
prediction and validity generalization, leads to the conclusion that 
the degree and nature of differential prediction vary across sam- 
ples. Such differences remain after some methodological and sta- 
tistical artifacts that affect the observed variance of differential 
prediction across institutions are taken into account. Thus, the lack 
of differential prediction generalization is not because of artifacts 
such as sampling error, criterion measurement error, and range 
restriction. Moreover, our results suggest that hundreds of thou- 
sands of individuals attend institutions for which there is differ- 
ential prediction of first-year GPA and, consequently, scores are 
under or over predicted based on a student’s ethnicity and gender 
when a common regression line is used to make admissions and 
other decisions. Because predictions of GPA are used by many 
institutions to make admissions, scholarship, and other important 
decisions that affect the lives of students and their families, there 
is an important need for future research aimed at understanding the 
reasons for differential prediction and differential prediction vari- 
ability across institutions. 
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